July 28, 2021

What anonymization techniques to protect your personal data?

After differentiating the concepts of anonymization and pseudonymization in a previous article, it is important for the Octopize team to take stock of the various existing techniques for anonymizing personal data.

What are the different anonymization techniques?

Anonymization techniques

Before talking about data anonymization, it should be noted that it is first necessary to perform a pseudonymization in order to remove any directly identifying character from the data set: this is an essential first security step. Anonymization techniques make it possible to take care of almost identifying attributes. By combining them with a prior pseudonymization step, we ensure that direct identifiers are taken care of and thus protect all personal information related to an individual.

Second, as a reminder, anonymization consists in using techniques in such a way as to make it impossible, in practice, to re-identify the individuals at the origin of anonymized personal data. This technique is irreversible, which means that anonymized data is no longer considered personal data, thus going beyond the scope of application of the GDPR.

To characterize anonymization, the EDPS (European Data Protection Board), formerly the G29 working group, set out 3 criteria to be met, namely:

Individualization: is it always possible to isolate an individual?
The correlation: is it always possible to link records relating to an individual together?
The inference: can we deduce information about an individual?

The EDPS then defines two main families of anonymization techniques, namely randomization And the generalization.

Randomization is the process of changing attributes in a data set so that they are less accurate, while maintaining the overall distribution.

This technique protects the data set from the risk of'inference. Examples of randomization techniques include The addition of noise, the permutation and the differential privacy.

Randomization situation: permuting data relating to the date of birth of individuals in order to alter the veracity of the information contained in a database.

Generalization consists in changing the scale of the attributes of data sets, or their order of magnitude, to ensure that they are common to a set of people.

This technique makes it possible to avoid the individualization of a data set. It also limits the possible correlations of the dataset with others. Examples of generalization techniques include: Aggregation, the k-anonymity, the Diversity or even the T-proximity.

Generalization situation: in a file containing the date of birth of individuals, the fact of replacing this information by the year of birth alone.

These various techniques make it possible to meet certain challenges with their share of advantages and disadvantages. We will thus detail the operating principle of these various methods and will explain, through factual examples, the limits to which they are subject.

What technique should you use and why?

Each of the anonymization techniques may be appropriate, depending on the circumstances and context, to achieve the desired purpose without compromising the right of the persons concerned to respect their privacy.

The randomization family:

1- The addition of noise:

Principle: Changing the attributes of the data set to make them less accurate. Example: following anonymization by adding noise, the age of the patients is changed by more or less 5 years.

Strengths :

If noise addition is applied effectively, a third party will not be able to identify an individual nor will they be able to restore the data or otherwise discern how the data was changed.
Relevant when attributes can have a significant negative effect on individuals.
Maintain the general distribution.

Weak spots :

The noise introduced alters the quality of the data, so the analyses carried out on the data set are less relevant.
The noise level depends on the level of information required and on the impact that the disclosure of attributes would have on the privacy of individuals.

Common mistakes :

Adding inconsistent noise : If the noise is not semantically viable (that is, if it is disproportionate and does not respect the logic between the attributes of a set) or if the data set is too sparse.
Assume that adding noise is sufficient : the addition of noise is a complementary measure that makes it more difficult for an attacker to recover data; it should not be assumed that it represents an anonymization solution that is sufficient in itself.

Failed to use :

Netflix case :

In the Netflix case, the initial database was made public “anonymized” in accordance with the company's internal privacy policy (by removing all user identifying information except ratings and dates).

In this case, it was possible to re-identify 68% of Netflix users thanks to a database external to this one, by crossing. Users were uniquely identified in the data set using 8 evaluations and dates with a margin of error of 14 days as selection criteria.

2- The permutation:

Principle : Consists of mixing attribute values in a table in such a way that some of them are artificially linked to different people concerned. Permutation therefore alters the values within the data set by simply exchanging them from one record to another. Example: following anonymization by permutation, the age of patient A was replaced by that of patient J.

Strengths :

Useful when it is important to maintain the exact distribution of each attribute in the data set.
Guarantee that the range and distribution of values will remain the same.

Weak point :

Does not make it possible to maintain correlations between values and individuals, therefore makes it impossible to carry out advanced statistical analyses (regression, machine learning, etc.)

Common mistakes :

Selecting the wrong attribute: exchanging attributes that are not sensitive or do not involve risks does not bring significant gains in terms of personal data protection. Therefore, if the sensitive attributes remain associated with the original value, an attacker will still have the option of extracting them.
Random permutation of attributes : If two attributes are highly correlated, swapping attributes randomly will not offer solid guarantees.

Failed to use: the permutation of correlated attributes

In the following example, we can see that intuitively, we are going to look to link wages with jobs according to correlations that seem logical to us (see arrow).

Thus, the random permutation of attributes does not offer guarantees of privacy when there are logical links between different attributes.

Table 1. Example of ineffective anonymization by permutation of correlated attributes

3- Differential privacy:

Principle: Differential privacy is the production of anonymized insights into a data set while maintaining a copy of the original data.

The anonymized preview is generated as a result of a request made by a third party on the database and whose result will be associated with an addition of noise. To be considered “differentially private”, the presence or absence of a particular individual in the request must not be able to change its result.

Strong point :

Adaptability: Unlike the practice of sharing data as a whole, the results of queries from Differential Privacy can be given on a case-by-case basis according to requests and authorized third parties promoting governance issues.

Weak spots :

Does not allow the dataset to be shared in its initial structure, thus limiting the range of analyses that can be carried out.
The check must be permanent (at least for each new request) to identify any possibility of identifying an individual in all the results of the query.
Do not modify the data directly because it is an addition of noise after the fact and relative to a request. So the original data is still there. As such, the results can also be considered personal data.
To limit inference and correlation attacks, it is necessary to keep track of requests submitted by an entity and to monitor the information obtained about the individuals involved. Databases with “differential privacy” therefore exist and are a weakness of the method because they should not be deployed on open search engines that do not allow the identity of the requester and the nature of his requests to be controlled.

Common mistakes :

Not injecting enough noise : In order to prevent connections from being established with knowledge drawn from the context, noise must be added. The most difficult thing, from a data protection perspective, is to manage to generate the appropriate level of noise to add to the real answers, in order to protect the privacy of individuals without undermining the utility of the data.
Do not allocate a security budget : it is necessary to keep the information of the requests made and to allocate a security budget that will increase the amount of noise added if a request is repeated.

Usage failures :

Independent treatment of each request : Without maintaining the history of requests and adapting the noise level, the results resulting from the repetition of the same request or a combination of these, could lead to the disclosure of personal information. An attacker could in fact make several requests that gradually isolate an individual and bring out one of his characteristics. It should also be taken into account that Differencial Privacy only allows one question to be answered at a time. Thus, the original data must be maintained throughout the defined use.

Reidentification of individuals: The differential privacy does not guarantee the non-disclosure of personal information. An attacker can in fact re-identify individuals and bring out their characteristics using another data source or by inference. For example, in this article (source: https://arxiv.org/abs/1807.09173) researchers from Georgia Institute of Technology (Atlanta) have developed an algorithm, called “membership inference attacks”, which re-identifies training data (and therefore sensitive) from a differential privacy model. The researchers conclude that further research is needed in order to find a stable and viable differential privacy mechanism against membership inference attacks. Thus, differential confidentiality does not appear to be a completely secure protection.

The generalization of the family:

1- Aggregation and k-anonymity:

Principle: Generalization of attribute values to the extent that all individuals share the same value. These two techniques aim to prevent a person concerned from being isolated by grouping them with at least k other individuals. Example: for there to be at least 20 individuals sharing the same value, the age of all patients between 20 and 25 years old is reduced to 23 years old.

Strong point :

Individualization: Once the same attributes are shared by k users, it should no longer be possible to isolate an individual within a group of k users.

Weak spots :

Inference: K-anonymity does not prevent any type of inference attack. Indeed, if all individuals are part of the same group, as long as we know which group an individual belongs to, it is easy to obtain the value of this property.
Loss of granularity: Data resulting from generalization processing necessarily lose fineness and sometimes coherence.

Common mistakes :

Neglecting some virtual identifiers: The choice of the parameter k constitutes the key parameter of the k-anonymity technique. The higher the value of k, the more guarantees the method provides in terms of privacy. However, the common error consists in increasing this parameter without considering all the variables. However, sometimes one variable is enough to re-identify a large number of individuals and make the generalization applied to other quasi-identifiers useless.

Low k value: If k is too small, the weight of an individual within a group is too large and inference attacks are more likely to succeed. For example, if k=2 the probability that the two individuals share the same property is greater than in the case where k >10.

Do not group individuals whose weight is similar: The parameter k must be adapted to the case of unbalanced variables in the distribution of its values.

Failed to use :

The main problem associated with k-anonymity is that it doesn't prevent inference attacks. In the following example, if the attacker knows that an individual is in the data set and was born in 1964, they also know that this individual had a heart attack. Moreover, if we know that this data set was obtained from a French organization, we can deduce that each of the individuals resides in Paris (since the first three digits of the postal codes are 750*).

Table 2. An example of ill-designed k-anonymization

To fill the shortcomings of k-anonymity, other aggregation techniques have been developed, including L-diversity and T-proximity. These two techniques refine k-anonymity by ensuring that each of the classes has L different values (l-diversity) and that the classes created resemble the initial distribution of the data.

It should be noted that despite these improvements, This does not allow us to address the main weaknesses of k-anonymity presented above.

Thus, these various generalization and randomization techniques each have safety advantages but do not always fully meet the 3 criteria set out by the EDPS, formerly G29, as shown in table 3 “Strengths and weaknesses of the techniques considered” carried out by the CNIL.

Tableau comparatif des méthodes d'anonymisation _ CNIL

Table 3. Strengths and weaknesses of the techniques considered

Derived from more recent anonymization techniques, synthetic data now appear to be better anonymization solutions.

Synthetic data case

The last years of research have seen the emergence of solutions allowing the generation of synthetic recordings ensuring a high retention of the statistical relevance and facilitating the reproducibility of scientific results. They are based on the creation of models to understand and reproduce the global structure of the original data. In particular, a distinction is made between adversary neural networks (GANs) and methods based on conditional distributions.

Strong point :

High level of guarantee in terms of maintaining the structure, fineness and statistical relevance of the data generated.

Weak point :

Models can lead to the generation of summary data that is very similar or even equivalent to the original records. Faced with a situation where an attack would link this summary data to an individual, the only defense is to affirm that the attacker is not in a position to prove this link. This situation can lead to a loss of trust among the people behind the data.

The avatar anonymization software, developed by Octopize, uses a unique, patient-centered conceptual approach, allowing the creation of anonymous, protected and relevant synthetic data while providing proof of their protection. Its compliance has been demonstrated by the CNIL on the 3 criteria of the EDPS. Click here to learn more about avatar data.

Rapid evolution of techniques

Finally, the CNIL (National Commission for Information Technology and Freedoms) recalls that since anonymization and re-identification techniques are subject to regular evolution, it is essential for any data controller concerned, to carry out a Regular stand-by to preserve, over time, the anonymous nature of the data produced. This monitoring must take into account the technical resources available and other data sources that can make it possible to remove the anonymity of information.

The CNIL underlines that research into anonymization techniques is continuing and shows definitively that no technique is, in itself, infallible.
‍