April 20, 2023

How to assess the utility of synthetic data?

There are numerous ways to assess the preservation of the utility of synthetic data. Here we present uni-variate, bi-variate, and multivariate analyses to assess the information retention of synthetic and anonymous avatar data. We show that avatar data maintains the utility of the original data.

The use of synthetic data is becoming increasingly popular for data analysis and machine learning. By generating new data that mimics the statistical properties of the original data without copying it, synthetic data can be used to exploit the potential of data without compromising the privacy of individuals.

However, to ensure that synthetic data is useful and effective, it is important to assess its utility. In this article, we'll look at How to assess the utility of synthetic data and ensure that they can be used effectively for analysis and modeling.

To assess the level of information retained in synthetic data, we use utility measures that assess two aspects: consistency at the individual level and coherence at the population level.

By individual coherence, we mean all the logical rules that must be respected. This criterion depends on the data set and will not be developed in this article.

Coherence at the population level means that there is statistical similarity between the original data and the synthetic data. We assess this similarity at three levels:

Comparing the distributions of variables (univariate analysis)
Comparison of dependencies between variables (bivariate analysis)
Comparing general data information (multivariate analysis)

In this article, we will describe how to assess the retention of statistical information at the population level. This analysis is global and not specific to the use case. For specific use cases, it is recommended to compare the original and synthetic data based on the analysis of interest.

There are as many possibilities for evaluating utility as there are possible analyses. Here, we will focus on a sample evaluation method.

Comparing the distributions of variables

For each variable in a data set, we compare the distribution of that variable in the original data set (in gray) and in the synthetic data set (in green). The Hellinger distance can be calculated between the two distributions. It results in a score between 0 and 1. 0 means that the two distributions are the same, while 1 means that the distributions have no common values.

In the figure below, we can see small Hellinger distances, which reveal that the Avatar data distributions are similar to the original distributions.

In other cases, we can also use statistical tests such as the Kolmogorov-Smirnov test or the chi-square test to assess whether the original samples and Avatar are drawn from the same distribution.

Comparing dependencies between variables

Evaluating the distributions of variables is not enough. If we generate synthetic data by drawing each variable independently, the distributions will be preserved but the correlation between the variables will be destroyed. So, synthetic data may not be useful for analyses or modeling tasks that depend on this correlation. Therefore, in addition to distribution comparisons, it is also important to compare dependencies or correlationships between variables. The Pearson correlation coefficient is generally used to assess the linear relationship between numerical variables.

Here, we observe that the avatar data preserves the correlation matrix of the original data.

With this analysis, we understand that the Avatar method preserves dependencies between variables (bivariate analysis). Weak correlations remain weak as a result of anonymization, while stronger correlations remain strong. Other measures, such as the”Mutual Information”, could be calculated to assess the conservation of the bivariate utility of categorical data.

Comparing general data information

Preserving the general information contained in a dataset is one of the main goals of anonymization. In order to assess multidimensional utility, we can use factor analysis methods (FAMD, PCA, MCA). These make it possible to study the link between numerous variables and individuals in a data set.

The visualization illustrates the similarity between the original data (in gray) and the avatar data (in green). In fact, we can see that the links between the variables and the groups in the dataset are preserved in the avatar dataset.

In the figure below, we see that the information on the variable”Preanti” are retained during anonymization.

In summary, it is important to ensure that synthetic data preserves the useful information in the data. This evaluation is done through the use of metrics. By ensuring that synthetic data is consistent at the individual and population levels, we can ensure that synthetic data can effectively replace original data for analysis and modeling purposes.

• Check out our technical documentation To see a example of an anonymization report that assesses the privacy and utility of avatar data.

• To find out more, read our scientific article published in Nature NPJ Digital Medicine which demonstrates the preservation of the utility and privacy of the avatar software in two medical use cases.
‍

Editors: Julien Petot & Alban-Félix Barreteau

How to assess the utility of synthetic data?

Comparing the distributions of variables

Comparing dependencies between variables

Comparing general data information

Other items