June 18, 2021

Is your data pseudonymized or anonymized?

The concept of anonymous data crystallizes a large number of misunderstandings and false ideas to the point that the term “anonymous” does not have the same meaning depending on who uses it. To restore consensus, the Octopize team wanted to discuss the differences between pseudonymization and anonymization, two concepts that are often confused.

What is the difference between anonymization and pseudonymization?

The concept of anonymous data crystallizes numerous misunderstandings and false ideas to the point that the term “anonymous” does not have the same meaning depending on who uses it.
To restore consensus, the Octopize team wanted to discuss the differences between pseudonymization and anonymization, two concepts that are often confused.
At first glance, the term “anonymization” evokes the notion of mask, of concealment. We then imagine that the principle of anonymization amounts to masking the directly identifying attributes of an individual (name, first name, social security number). This shortcut is just the trap to avoid. In fact, the masking of these parameters is rather a pseudonymization.
At first glance similar, these two concepts nevertheless involve major differences, both from a legal point of view and from a security point of view.

What is pseudonymization?

According to the CNIL, pseudonymization is a “processing of personal data carried out in such a way that data relating to a natural person can no longer be attributed without additional information”. It is one of the measures recommended by the RGPD to limit the risks associated with the processing of personal data.

But pseudonymization is not an anonymization method. Pseudonymization simply reduces the correlation of a data set with the original identity of a data subject and is therefore a useful but not absolute security measure. Indeed, pseudonymization consists in replacing the directly identifying data (name, first name...) of a data set with indirectly identifying data (alias, number in a ranking, etc.) thus preventing the direct re-identification of individuals.

However, pseudonymization does not constitute infallible protection because the identity of an individual can also be deduced from a combination of several pieces of information called virtual identifiers. Thus, in practice, pseudonymized data remains potentially reidentifying indirectly by cross-referencing information. The identity of the individual may be betrayed by one of his indirectly identifying characteristics. This transformation is therefore reversible, justifying the fact that pseudonymized data is always considered personal data. To date, the most used pseudonymization techniques rely on cryptographic systems with secret keys, hashing functions, deterministic encryption or even Tokenization.

THE ”AOL (America On Line) case” illustrates in a typical way the misunderstanding that exists between pseudonymization and anonymization. In 2006, a database containing twenty million keywords from searches carried out by more than 650,000 users over a period of 3 months was publicly released, with no measures intended to preserve privacy other than replacing the AOL user ID with a numerical attribute (pseudonymization).
Despite this treatment, the identity and location of some users have been made public. Indeed, requests sent to a search engine, especially if they can be combined with other attributes, such as IP addresses or other configuration parameters, have a very high identification potential.

This incident is just one example among the many pitfalls showing that a set of pseudonymized data is not anonymous; simply changing the identity does not prevent an individual from being re-identified based on almost identifying information (age, gender, postal code). In many cases, it can be as easy to identify an individual in a set of pseudonymized data as it is from the original data (“Who is this?” game).

What is the difference with anonymization?

Anonymization, on the other hand, consists in using techniques in such a way as to make it impossible, in practice, to re-identify the individuals at the origin of anonymized personal data. This treatment is irreversible, which means that anonymized data is no longer considered personal data, thus going beyond the scope of application of the GDPR. To characterize anonymization, the European Data Protection Board (ex G29) is based on the 3 criteria set out in the opinion of 05/2014 (Source at the bottom of the page):

- Individualization : anonymous data should not make it possible to distinguish an individual. Therefore, even with all the almost identifying information relating to an individual, it must be impossible to distinguish the individual in a database once anonymized.

- Correlation : anonymous data should not be able to be re-identified by cross-referencing it with other data sets. Thus it must be impossible to link two sets of data from different sources concerning the same individual. Once anonymized, an individual's health data should not be able to be linked to their banking data on the basis of common information.

- Inference : the data should not allow additional information about an individual to be inferred reasonably. For example, it must be impossible to determine with certainty the health status of an individual based on anonymous data.

It is when these three criteria are met that data is considered anonymous strictly speaking. They then change their legal status: they are no longer considered as personal data and go beyond the scope of the GDPR.

Our avatar solution

To date, there are several families of anonymization methods that we will detail in our next article. For the most part, these methods provide protection by degrading the quality, structure, or fineness of the original data, thus limiting the informative value of these data after processing. The real challenge is to resolve the paradox between legitimate data protection of each, and their exploitation in the interests of all.

The avatar anonymization software, developed by Octopize, is a unique anonymization solution. It solves the paradox between the protection of patients' personal data and the sharing of this data for its informative value. Indeed, the avatar software, which was successfully evaluated by the CNIL, makes it possible, thanks to summary data, to ensure, on the one hand, the privacy of the original data (and therefore their safe sharing) and, on the other hand, to maintain the informative value of the original data.

Click here to find out more.
‍

Sourcing :

“AOL” case (America On Line): https://rig.cs.luc.edu/~rig/ecs/probsolve/NYTonSearch.pdf
European Data Protection Board (ex G29) - opinion of 05/2014: https://www.cnil.fr/sites/default/files/atoms/files/wp216_fr.pdf

‍