March 19, 2024

How to deal with missing data?

In this article, we explore a common challenge in data analysis: managing missing data. This incomplete data can jeopardize the reliability of statistical analyses, thus compromising the quality of the results. Our avatar solution goes beyond simply replacing or deleting missing values by using an intelligent approach to maintain privacy while maintaining the quality of analyses. In this article, we illustrate the effectiveness of our method through a concrete example. With avatar, we offer a reliable solution for dealing with missing data.

‍Managing missing data

In data analysis, the presence of missing data is often a major challenge for analyses. This incomplete data can compromise the reliability and accuracy of statistical analysis results. At Octopize, we understand the importance of intelligently managing these missing values when anonymizing data. Rather than simply deleting or replacing them, which would result in a real loss of information, we incorporated a method to deal with missing data in an efficient and intelligent way.

Why is the data missing and why should we be vigilant?

Missing data can occur for a variety of reasons, such as input errors, incomplete questionnaires, or technical faults. The presence of missing data can reduce the accuracy of analyses. It is necessary to understand where the missing data comes from and know how to react according to the type of missing value.

There are three main types of missing data:

Completely random missing data (MCAR): In this case, the probability of having a missing value is independent of the data set. In other words, there is no apparent reason behind the missing data. The probability of absence is the same for all observations.

EXAMPLE: if each participant in a survey decides to answer the income question by rolling a die and refusing to answer if side 6 appears

Randomly missing data (MAR): The probability of having a missing value is always random but it is linked to other observed variables.

EXAMPLE: For example, older people are more likely to not report their salary. The missing aspect is therefore important for this variable; anonymization must maintain the missing nature of the MAR variables. See the example below for more details.

Non-random missing data (MNAR): In this case, the probability of having a missing value in a variable depends on that variable. Missing data cannot be inferred using the other information in the dataset. But the information is not missing by chance either.

EXAMPLE: An individual's salary is missing because it is high. In other words, the wealthiest people tend not to answer the salary question. Missing non-random data can be the most difficult to deal with because they are linked to factors that were not observed.

Dealing with missing data during anonymization with the avatar method

One of the approaches to anonymizing missing data, with the avatar solution, is to impute them. This imputation can be carried out prior to anonymization. Imputation involves estimating missing values based on the most similar values in the data set (imputation by nearest neighbors). This imputation makes it possible to maintain the structure and relationships between the data, while guaranteeing a sufficient level of privacy. However, it is important to note that this method leads to a loss of information especially when the data is missing non-randomly (MNAR) or randomly (MAR).

‍

To overcome this, we have developed another approach in the Avatar method, allowing missing data to be processed more efficiently. This approach consists in letting the anonymization process generate a synthetic data set that includes missing data. The process takes place in 2 steps. First, a new column is created to indicate the presence of missing values (True, False). In a second step, the missing values from the starting column are imputed. This method allows us to have a complete data set during anonymization and more precisely when projecting data into a multidimensional space.

After anonymization, we remove the data flagged as missing (True) in order to obtain an anonymized dataset containing missing data. The avatars generated will have missing data that maintains the same structure and the same relationships as in the original data. Information relating to missing data will be retained throughout the anonymization process.

Example of anonymizing missing data

To illustrate the anonymization of missing data, we use a synthetic data set where the salary variable contains missing data.

*Table 1: Extracted from the data set, the salary variable contains missing values*

To understand the data we made a few graphs (charts 1 and 2). Thanks to these graphs we can observe that the variables age and number of years of study are related to the salary variable.

Furthermore, we note that the age distribution of persons who answered the wage question is different from the age distribution of persons who did not answer the salary question (Chart 4).
We can make the same observation for the distribution of the number of years of study (Chart 3).

However, people who did not answer the salary question tend to be older and have more education. It can therefore be assumed that the more people earn, the more likely they are not to answer the salary question (MAR).

‍

Will the information carried by the missing data be retained during anonymization with the avatar solution?

‍

Here is an excerpt of the dataset after anonymization

*Table 2: Excerpt from the anonymized dataset*

We can create the same graphs as before on Avatars data (charts 5 and 6). We observe the same conclusions: age and the number of years of studies are linked to the salary variable.

Do the age and number of years of study distributions vary if the salary is missing or not?

To answer this question we carried out the same analyses as in our first part and we observe the following distributions:

We observe that the distributions of age and the number of years of studies vary according to the presence of the salary variable. Older people who have a lot of education tend not to answer the salary question.

So we can say that anonymizing the data maintains the structure of the missing data in the dataset.

‍

Conclusion

It is important to understand and analyze the cause of the absence of data (MCAR, MNAR, MAR), in order to choose the anonymization method that best suits your needs. In this example, we have shown that the avatar method preserves the information carried by the missing data. The avatar method provides high-quality anonymized data, maintaining both the accuracy of the analyses and the privacy of the data. To learn more about how the avatar method works, we invite you to consult the documenting.

‍
Editors: Lucie Raimbault & Julien Petot