Research IT

Processing UK Biobank Datasets

When faced with a large dataset from UK Biobank, Alex Casson’s research group needed an easy and quick way to process it. Christopher Beach explains how they tackled this issue with some help from Research IT.


The UK Biobank have collected activity data from over 100,000 participants who wore a wrist-worn accelerometer continuously for seven days while they carried out their typical activities, as one of their follow-on studies to complement their initial release of data. This dataset enables a detailed and objective understanding of how physical activity affects health, rather than relying on imprecise self-reported measures. The dataset consists of high resolution acceleration information in three axes resulting in a large amount (25 TB) of compressed data, requiring substantial resources to process.

The accelerometer files are in a compressed proprietary binary format that needs to be calibrated (to account for variations in local gravity), resampled (due to variations in accelerometer sampling rate) and filtered (to remove gravity and sensor noise). Researchers at the University of Oxford have developed code that performs each of these three tasks, but this has not been developed to enable processing of the raw data, nor does it facilitate researchers in developing their own algorithms.

We have adapted the Oxford code to allow processing on the raw (but resampled) in Python, allowing access to the full rich data, rather than limited summary statistics or 5 s epochs containing averaged information. Depending on your processing, the access to this raw data may be essential to develop your own algorithms that could not be developed on the summary data.

By working with Research IT, particularly with help from Pen Richardson and George Leaver at a drop-in session, we have written scripts to allow this code to be run on the high-performance computing infrastructure at the University of Manchester (the CSF3). This allows a far greater throughput than running on your own PC, as multiple records can be processed simultaneously. Further, you may find that your computer lacks the resources (due to limited RAM) to process these large files, a problem which is easily resolved on the CSF3.

This work will enable not just researchers at UoM to use the CSF3 to process the full dataset, but also enable those at other institutions with similar high-performance computing infrastructure and those who want to process the full data on their own computer (rather than summary statistics) to fully utilise this dataset from the UK Biobank.

We have used this data to convert the measures of physical activity into estimates of energy harvesting potential, to identify if our daily activities produce enough energy to power wearable devices without the need for battery recharging. We are currently writing up this work which we aim to publish soon, so keep an eye out!

The code and our documentation are available on github.

If you have an issue or problems you think we could help with please get in touch or come along to one of our virtual drop-in sessions.