January 31, 2023

Assessing the privacy of a data set

Regardless of the technique chosen to process personal data in order to protect it, there is no guarantee of privacy until measures are calculated on the data processed. These privacy measures should aim to cover different attack scenarios and must meet the 3 criteria of the GDPR defined by the EDPS: individualization, correlation and inference. In this article, we provide an overview of some of the privacy measures we use at Octopize to ensure that our avatar software produces synthetic data that is completely anonymous.

One of the essential points to address before delving into the question of the privacy of a dataset is the concept of pseudonymization versus that of anonymization. These terms are often used interchangeably, but are in fact very different in terms of protecting individuals.

La pseudonymization consists in replacing direct identifiers such as first name and last name with new identifiers, using techniques such as hashing, tokenization or encryption. Pseudonymized data is always considered personal data and remains subject to the GDPR (General Data Protection Regulation).
Anonymization consists in using techniques that make it impossible in practice to re-identify an individual in a data set. This treatment is irreversible and implies that anonymized data is no longer considered personal data. This data therefore does not fall within the scope of the GDPR. There are a variety of anonymization methods. The choice depends on the degree of risk and the intended use of the data.

Note that pseudonymization is a necessary step before anonymization, as direct identifiers do not add value to a data set.

To be considered anonymous, a dataset must meet the three criteria identified by the European Data Protection Board (EDPS, formerly known as G29). To measure compliance with these criteria, it is always necessary to compare the original data set to its processed version, processing being any technique aimed at improving the privacy of the data set (adding noise, generative models, Avatar).

Privacy according to the EDPS

Before we dive into the specific measures and how they are measured, we need to clarify what we are really trying to prevent.

We are going to take the official EDPS criteria and add some examples to highlight the main differences between the three.

These criteria are as follows:

Individualization is the risk of identifying an individual in a data set.

Example: you work in an insurance company and you have a set of data on your customers and their vehicles. You simply remove the personal identifiers, that is, their name. But because the combination of the other values is unique (vehicle type, brand, vehicle age, color), you are in a position to directly identify each of your customers, even without their name.

La correlation is the ability to connect individuals with an external data source that has common characteristics.

Example: In a recruitment agency dataset, customers and their salaries, as well as other information, are listed. In a separate, publicly accessible database (for example LinkedIn), you gather information such as job title, city, and company. With this information, you are in a position to connect each individual from one data set to the next, allowing you to gain new information, such as salary.

The inference is the possibility of deducing, with significant probability, information about individuals using the anonymized data set.

Example: a pharmaceutical industry has a data set on people who participated in a clinical trial. If you know that a particular individual is male, and that all of the men in the data set are overweight, you can infer that this specific individual is overweight without distinguishing between them.

Assessing the risks of individualization

The first family of metrics that we will now present aims to assess the protection of a data set against individualization attacks. These attacks can take a variety of forms, requiring different complementary measures. Some individualization measures are independent of the model and can therefore be used on any pair of original and processed datasets. Other metrics require temporarily maintaining a link between the original and processed individuals.

Model agnostic metrics

We now present two simple metrics that can be used on datasets processed by any technique. These metrics are particularly useful when it comes to comparing the results of different approaches.

Distance To Closest. To calculate the DTC, we measure the distance between each synthetic individual and its closest original. The median value is retained in order to have a single representative value associated with this measure. The reasoning behind DTC is that if each synthetic individual is close to an original, the data set could present a risk of individualization. However, a low DTC does not necessarily mean that there is a risk and, therefore, the Closest Distances Ratio must be measured to complete it.
Closest Distances Ratio. As with DTC, CDR is calculated by first measuring the distance between an avatar and its closest original individual, divided by the distance to its second closest original individual. In other words, the distance between the two closest original individuals is measured. If the ratio is high, the two closest originals are at the same distance and it is therefore impossible to distinguish them with certainty in practice. From the ratios calculated for each treated individual, the median is kept to provide a unique CDR value. There is a risk of individualization when both DTC and CDR are low.
Current Hidden Rate. The Current Hidden Rate is the probability that an attacker will make a mistake by connecting an individual to their most similar avatar (synthetic individual). This is where the link between the original and the avatar that has been temporarily retained comes in handy.

To go further: our metrics

A data set with a high DTC and CDR will ensure that the treatment that has been applied to the data has changed the characteristics of individuals. However, even if the avatars are far removed from the originals, there is still a risk that the original individuals could be associated with their most similar synthetic counterpart.

At Octopize, our processing generates anonymous synthetic data. We developed additional measures, putting ourselves in a worst-case scenario where an attacker has both original and anonymous data. Although unlikely in practice, this approach is recommended by the EDPS. The Hidden Rate And the Local Cloaking are metrics that make it possible here to measure the protection of data against individualization attacks based on distance. These two metrics require that the link between each individual and their synthetic version be available.

To illustrate these metrics, let's look at a simplified example where a cohort of animals (why not! ?) would be anonymized (with our Avatar solution for example).

With individual-centered anonymization solutions, a synthetic individual is generated from an original. The link between originals and synthetic individuals can be used to measure the level of protection against distance-based attacks. In our example, we see that the red cat has been anonymized as a cheetah while the synthetic individual created from the tiger is a black cat.

A distance-based attack assumes that the sidelining can be done by associating an original with its most similar synthetic individual. In our example, a link based on distance would associate the red cat with the black cat, the tiger with the cheetah, and so on.

Current Hidden Rate. The Current Hidden Rate is the probability that an attacker will make a mistake by connecting an individual to their most similar synthetic individual. This is where the link between the original and the synthetic, which has been preserved temporarily, comes into play.

The current Hidden Rate measures the probability that an attacker will make a mistake when they associate an individual with their most similar synthetic individual. In this illustration, we see that most distance-based matches are not correct and that Hidden Rate is therefore high, illustrating good protection against distance-based singularization attacks.

Local Cloaking. Local Cloaking represents the number of synthetic individuals that look more like an original individual than the synthetic individual it generated. The higher this number for an individual, the more protected they are. The median of local cloaking across all individuals is used to evaluate a data set.

In this figure, we illustrate how local cloaking is calculated for a single original individual, in this case the ginger cat. Thanks to the link that we keep temporarily, we know that the real synthetic individual generated from the red cat is the cheetah. His local cloaking is the number of synthetic individuals between him and the cheetah. In this example, there is only one synthetic individual: the black cat, which means that the local cloaking of the red cat is 1. The same calculation is carried out for all originals.

The four metrics we've just looked at provide good coverage of protection against individualization attacks, but as we saw at the beginning of this article, there are other types of attacks that personal data should be protected against.

Correlation risk assessment

Metrics that meet the criteria of correlation respond to an attack scenario more common and more likely.

The attacker has a processed data set and an external identification database (for example, a register of voters) containing information common to the processed data (for example, age, gender, postal code). The more information the two databases have in common, the more effective the attack will be.

Correlation protection rate

The Correlation protection rate (Correlation Protection Rate) assesses the percentage of individuals who would not be successfully connected to their synthetic counterpart if the attacker used an external data source. Variables selected as being common to both databases should be likely to be found in an external data source. (For example, age should be taken into account while concentration_insuline_D2 should not be taken into account). To cover the worst-case scenario, we assume that the same individuals are present in both databases. In practice, some individuals in the anonymized database are not present in the external data source and vice versa. This metric is also based on the fact that the link between the original and the synthetic is maintained temporarily. This link is used to measure how many matches are incorrect.

Inference-based risk assessment

Metrics that meet the criteria Of inference respond to another type of attack where the attacker seeks to infer additional information about an individual from the anonymized data available.

Inference metrics. La inference metric Calculate the possibility of deducing, with significant probability, the original value of a target variable from the values of other variables treated. The inference metric can be used on numerical and categorical targets. When the target is numerical, we speak of a regression inference metric and we evaluate protection as the average absolute difference between the value predicted by the attacker and the original numerical value. On the other hand, we speak of a classification inference metric when the target is categorical and the level of protection is represented by the accuracy of the prediction.

How does that happen in practice?

Our anonymization software, avatar, calculate all of the above metrics and more. Our mission is to generate anonymous data sets with a fully explainable model and concrete privacy measures that allow us to measure the degree of protection.

To do this, there are a lot of things to take into consideration and making an anonymous dataset should not be taken lightly. There are numerous pitfalls that can lead to accidental information leaks. That is why, in addition to the measures and the associated guarantee of privacy, we produce a anonymization report which clearly describes the various measures, as well as the evaluation criteria they aim to measure, as described above. The report explains, in simple terms, all measurements and presents statistics on datasets, before and after anonymization.

In practice, the anonymization of a data set is always a compromise between guaranteeing privacy and maintaining utility. A completely random data set is private, but is useless.

We'll look at how to measure the utility of a data set, before and after anonymization, in a future article.

Interested in our solution? Contact us !

Editors: Tom Crasset & Olivier Regnier-Coudert

‍