Research IT

Developing an open source environmental data-set and tools

Douglas Lowe and Ann Gledson, Research Software Engineers from Research IT, recently presented an online poster updating on the work they have been doing for an Alan Turing Institute funded project led by David Topping (Department of Earth and Environmental Sciences) and Caroline Jay (School of Computer Science).

The importance of quantifying the detrimental impacts of environment data such as air quality and weather has become a global priority for both researchers and policy makers. Working on this project (Understanding the relationship between human health and the environment), our main aim has been to study the relationship between environment data and self-reported allergy symptom data from the Britain Breathing project. Unfortunately, as with so many data analysis projects, we soon discovered that the primary challenge was the difficulty of obtaining, cleaning and preprocessing the required data, as the current systems and methodologies used to support this are far from perfect.

For our analysis, the data had to be extracted from three data-sets: air quality measurements from the Automatic Urban and Rural Network (AURN); weather and pollen measurements from the Medical and Environmental Data Mash-up Infrastructure (MEDMI); and modelled forecast data generated using the European Monitoring and Evaluation Programme (EMEP) model . Once obtained, much further work was required, such as detecting and removing duplicates and unrealistic measurements, imputing missing data values as accurately as possible and augmenting data for those regions with sparse coverage.

Keen to open up the resulting work to other researchers and prevent any repetition of these arduous processes (and possible abandonment altogether by those lacking the time and/or skills), we are publishing our data-sets and data extraction/cleaning tools, with a focus on making them accessible and adaptable as possible, thus allowing researchers to get on with the job of evaluating the impact of environmental data, rather than becoming bogged down in pre-processing. Importantly, the emphasis on a transparent data processing and visualisation methodology enables researchers to determine the usefulness, or not, of any technique used for mapping single site measurements to represent a specific geographical region.

A poster detailing this work was well received at the online Environmental Intelligence 2020 conference and has already generated interest and downloads of our data-sets and tools. The datasets that we have made available include daily air quality (NO2, NOX, SO2, O3, PM10, PM2.5), pollen (e.g. ambrosia, urtica) and weather (temperature, pressure, and relative humidity) readings from AURN and MEDMI in the years 2016 to 2019 inclusive, for the United Kingdom. The cleaned data-set is currently available online and an imputed version of the data is soon to be added. In addition, we will soon be publishing a data-set of UK regions with estimations of environmental measurements for regions where sensors are lacking, using measurement data from surrounding regions.


The tool-sets: The tools used to download and process the above measurement datasets are available as a github repository on the University of Manchester Research IT group.

The region estimator tool-set repository is also available in the same github repository.

Visualisation app: A data visualisation and use-case of the above datasets and tools, it displays estimated regional measurements alongside known values.

Published paper: Reani, M., Lowe, D., Gledson, A. et al. UK daily meteorology, air quality, and pollen measurements for 2016–2019, with estimates for missing data. Sci Data 9, 43 (2022).