Dealing with missing data
One of the key challenges faced by researchers when using both survey and administrative data is data missingness. This refers to a situation where some values or observations are not present in a dataset, either because they were not collected or because they were lost or incomplete. It is important to properly address data missingness as it can significantly undermine the validity of empirical analysis and the estimates produced. Any estimation of population needs, the impact of an intervention, or the relationship between two variables can be severely affected by missing data.
According to UK Research and Innovation (2020), around 2,000 publications utilised UK administrative data between 2017 and 2019, covering a wide range of topics. The Office for National Statistics (ONS) is currently undergoing a population and migration statistics transformation programme, aiming to make greater use of administrative data for more frequent and timely outputs.
To support this effort, Alma Economics was asked by the ONS to conduct a Rapid Evidence Assessment of recent academic literature on data missingness. Our team identified and analysed the most prevalent forms of missingness and their implications, as well as the benefits and drawbacks of different methods that can be applied to address the issue. Our report lays out how to identify methods suitable for different types of datasets and causes of missingness, as well as the level of available information.
There are three main mechanisms of missingness: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). When data is missing completely at random, the observed population is a random sub-sample of the total population, and the missing information is less likely to impact the validity of any estimates produced.
When data is missing at random (MAR), missingness can be completely characterised by observed information. For example, such a case would occur when the probability of health status data being missing depends on the income of individuals, and income is a fully observed variable.
When data is missing not at random (MNAR), the probability of a data point missing depends on the value of the information that is itself missing. For instance, this would occur if individuals with lower incomes tend to report their income levels more rarely than those with higher levels of income.
A wide range of approaches to impute the values of missing data have been developed. These models aim to substitute missing values based on calculations or estimations. Such methods rely on the available data, the researchers’ knowledge of the population and the data collection exercise, as well as the missingness mechanism. Imputation methods can be as simple as replacing a missing value with the mean of the completed values but can also include regression analysis, likelihood maximisation, multiple imputation, and machine learning algorithms.
The choice of method is affected by many different factors. For example, the purpose of the research is of critical importance. In some cases, simplicity and transparency should be prioritised (e.g., when imputed data is published by a statistical authority), while in other cases, such as academic research, the impact of a method on causal inference may be the priority.
Our team provided a comprehensive overview of the key methods to handle data missingness, together with a discussion of their main advantages and disadvantages. We also developed a web-based interactive Evidence Map and an example Python code that illustrates how some of the methods discussed can be applied in practice.
Our full report can be found here.