How to measure the anonymity of a database?
In the era of Big Data, personal data is an essential raw material for the development of research and the operation of numerous companies. However, despite their great value, the use of this type of data necessarily involves a risk of re-identification and the leak of sensitive information even after having undergone prior pseudonymization treatment (see article 1). In the case of personal data, especially sensitive, the risk of re-identification can be considered as a betrayal of the trust of the individuals behind the data, especially when they are used without clear and informed consent.
The implementation of the General Data Protection Regulation (GDPR) in 2018 and the Data Protection Act before him offered an attempt to respond to this problem by initiating a change in the practices of collecting, processing and storing personal data. An independent think tank specializing in privacy issues has also been set up. Called the European Data Protection Board (EDPS) or formerly G29, this advisory body published work (Item No.: G29) which now serve as references to European national authorities (CNIL in France) in the application of the RGPD.
The EDPS agrees with the potential of anonymization to enhance personal data while limiting the risks for the individuals who are the source of it. As a reminder, data is considered anonymous if the re-identification of the original individuals is impossible. It is therefore an irreversible process. However, the anonymization methods developed to meet this need are not infallible and their effectiveness often depends on numerous parameters (see article 2). To use these methods optimally, it is necessary to provide additional precision on the nature of anonymous data. In its Opinion of 05/2014 on anonymization techniques, the EDPS identifies three criteria for determining the impossibility of re-identification; namely:
- Individualization
- Correlation
- Inference
- Individualization: is it always possible to isolate an individual?
The individualization criterion corresponds to the most favorable scenario for an attacker, i.e. a person, malicious or not, seeking to re-identify an individual in a data set. To be considered anonymous, a data set should not allow an attacker to isolate a target individual. In practice, the more information an attacker has about the individual he wants to isolate in a base, the higher the chances of re-identification. Indeed, in a pseudonymized data set, i.e. without its direct identifiers, the remaining semi-identifying information acts as a bar code for the identity of an individual when considered together. Thus, the more prior information the attacker has about the individual he is trying to identify, the more precise he can make a request to try to isolate this individual. An example of an individualization attack is shown in Figure 1.

Figure 1: Re-identification of a patient by individualization in a data set based on two attributes (Age, Gender)
One of the attributes of this type of attack is also the increased sensitivity of individuals with unusual characteristics. It will in fact be easier for an attacker, who only has information on gender and height, to isolate a woman measuring 2 meters than a man measuring 1 meter 75.
2. Correlation: is it always possible to link records relating to an individual together?
Correlation attacks are the most common scenario. Also, to consider data as anonymous, it is essential that it meets the correlation criterion. Between the democratization of Open Data and the numerous incidents related to personal data leaks, the quantity of data available has never been so significant. These databases, which contain personal information that is sometimes directly identifying, are all opportunities for attackers to make re-identification attempts by cross-referencing. In practice, correlation attacks use direct-identifying databases that have information similar to the base to be attacked as illustrated in Figure 2.

Figure 2: Illustration of a correlation attack. The directly identifying external base (top) is used to re-identify individuals in the attacked base (bottom). The correlation is based on common variables.
In the case of the tables illustrated in Figure 2, the attacker would have succeeded in re-identifying the 5 individuals in the pseudonymized base thanks to the two attributes common to the two bases. In addition, the re-identification would have allowed him to infer new sensitive information about patients, namely the pathology that affects them. In this context, the more common information the databases have, the more the probability of re-identifying an individual through correlation increases.
3. Inference: can we deduce information about an individual?
Finally, the third and last criterion identified by the EDPS is probably the most complex to assess. This is the inference criterion. To consider data as anonymous, it must be impossible to identify by inference, almost certainly, new information about an individual. For example, if a data set contains information on the health status of individuals who participated in a clinical study and all men over the age of 65 in this cohort have lung cancer; then it will be possible to infer the health status of some participants. In fact, it is enough to know a man over 65 who participated in this study to say that he is suffering from lung cancer.
The inference attack is particularly effective on groups of individuals sharing a single modality. If the inference is successful, the disclosure of the sensitive attribute then concerns the entire group of identified individuals.
These three criteria identified by the EDPS combine the majority of attack threats to data after it has been processed to maintain its security. If these three criteria are satisfied, the treatment can then be considered as anonymization in the true sense of the term.
Do current techniques make it possible to satisfy all three criteria?
Randomization and generalization techniques each have advantages and disadvantages with respect to each criterion (see article 2). The evaluation of the performance of meeting the criteria for several anonymization techniques is represented in Figure 3. It comes from the Opinion published by the former G29 on anonymization techniques.

Figure 3: Strengths and weaknesses of the techniques considered
It is clear that none of these techniques exists that allows the 3 criteria to be met simultaneously. They must therefore be used with caution in their most favorable context of use. Beyond the methods evaluated, anonymous synthetic data seems to be a promising alternative to satisfy all 3 criteria. However, methodologies for producing synthetic data must face the difficulty of providing proof of this protection. At present, all summary data generation solutions are based on the principle of Plausible deniability to prove the protection associated with a data. In other words, if synthetic data were by chance to resemble original data, the defense consists in announcing that under such circumstances, it is impossible to prove that this synthetic data is linked to original data. At Octopize, we have developed a unique methodology to produce anonymous synthetic data while quantifying and providing proof of the protection provided. This evaluation is carried out through metrics developed specially to measure the satisfaction of the criteria, namely you will have understood it, the individualization, the correlation and the inference. We'll expand on the topic of metrics for evaluating the quality and security of synthetic data in more detail in another article.