Synthetic vs anonymous data

“Synthetic” and “anonymous” are two terms that are often used interchangeably in discussions about data privacy. Although they are not incompatible, this article defines their characteristics as well as their differences.

Synthetic vs anonymous data

When it comes to using personal data for ethical uses that are secondary to the original purpose of collection, anonymous data and the synthetic data are often used without differentiation. However, these are two types of data that have their own characteristics and should not be confused.

Definitions

The General Data Protection Regulation (GDPR) defines anonymous data as follows:

“information that does not relate to an identified natural person or
identifiable or has been made irreversibly anonymous.”

In other words, anonymous data is data that cannot be used to identify an individual, even when combined with other external data sources (for example, a voter register). This type of data is not subject to the data protection rules of the GDPR, as it is not considered personal data. When anonymous, the people from whom the data is collected are protected from being re-identified. This property allows anonymous data to be used for a variety of secondary purposes, such as research, statistical analysis, and marketing, as the use of anonymous data does not require the consent of the individual concerned. However, it is important to note that the anonymization process must be carried out in accordance with the strict guidelines of the GDPR in order to ensure the protection of personal data. These guidelines are illustrated by the three criteria identified by the European Data Protection Council (EDPS, ex G29):

  • individualization
  • The correlation
  • The inference

See more details in this article.

Synthetic data: Artificially generated data that mimics the characteristics of real data. They are created using algorithms and statistical models to simulate data that looks like real data without containing real personal information. Les synthetic data are used for various purposes, in particular to train Machine Learning models, to test software applications or a production environment. One of the main benefits of synthetic data is that it can be generated at scale, making it ideal for scenarios where real data is either expensive or difficult to obtain.

Synthetic vs anonymous data

The fact that the synthetic data or artificially generated data could indicate that this data is anonymous by default. The ability to share the generation method rather than the data itself seems to be an additional guarantee of privacy and a paradigm shift in the use of data.

However, the generative models can also do not guarantee confidentiality learning data. Indeed, generative models can remember specific details of the training data, including the presence of specific individuals or personal information, and incorporate that information into the synthetic data that is generated. This type of privacy breach is called Membership inference attack, when a hacker is trying to determine if a specific person's data was used to train a machine learning model. This can lead to serious privacy breaches, especially with sensitive data.

In addition, the anonymous data are not always synthetics. For example, some anonymization methods are based on the aggregation of real data. K-anonyma is probably the best known of these aggregation methods, its refinements being Diversity and T-Closeness. These anonymization methods rely solely on aggregation and cannot be considered synthetic, as it is only a generalization of data content. So we have an example of data anonymous but not synthetic.

However, it should be borne in mind that a aggregation is not always anonymous no more. Let's imagine a data set containing the age of individuals. Naive aggregation into classes such as 0-49, 50-99, 100-149 would likely result in very few people in the third category, allowing for (too) easy identification.

Let's try to explain the confusion

The reason why synthetic data is often confused with anonymous data could be that most — if not all — anonymization methods that don't rely on creating synthetic data have too many disadvantages to be effective. It could be a lack of privacy, utility, or both.

For example, an aggregation method will not only lose some of its utility, but it will also change the structure of the data. Therefore, this method cannot replace sensitive data in a pipeline. We recommend This article if you want to go deeper into the subject of existing anonymization methods.

He explains why today, someone who wants to anonymize data will likely use a synthetic data generation method.

At Octopize, with our avatar software, we create synthetic and anonymous avatar data that looks like the original data but is false. Through metrics, we ensure that the EDPS guidelines are respected while maintaining the greatest utility of the data.

In summary, confidentiality is not taken for granted when processing synthetic data. The generation of private synthetic data is a subject of cutting-edge expertise, where some naive approaches tend to expose sensitive information. However, when used carefully, synthesizing anonymous data is now the most effective way to maintain maximum utility while maintaining privacy.

Interested in synthetic and anonymous data?
Contact us: contact@octopize.io !

Editors: Gaël Russeil & Morgan Guillaudeux

Sign up for our tech newsletter!