Depending on the context in which anonymized data is used, the risk The number of attempts at re-identification by an attacker, malicious or not, varies. For example, we can consider that anonymous data shared as part of an Open Data project is much more exposed than data shared internally to a trusted partner. In order to optimize the balance between privacy and the maintenance of statistical properties, it is necessary to be able to assess the level of privacy and utility data generated before being made available. However, it is a check carried out after the fact. In this article, we will focus on the key parameter of the avatar method that can influence the result of upstream privacy metrics, the k parameter.
How it works
The parameter k is used in step b of the avatar method. [see Figure 1].
Following the step of projecting each individual into a multidimensional space (similar to ACP); the parameter k makes it possible to determine the number of closest neighbors within this projection of individuals.
This will make it possible to determine the local modeling space for each avatar made in step c. The parameter is called k in reference to the KNN algorithm used for the identification of neighbors on the basis of an Euclidean distance. It is a numerical value between 2 (minimum number of neighbors required to create a modeling space) and the number of individuals present in the data set.

The higher a value is assigned to the parameter k, the more the space for modeling avatar data is expanded [see Figure 2]
.png)
Figure 2: Impact of the choice of the k value on the modeling space (in green) of the avatar of the original individual represented in red.
Knowing that an individual's avatar is generated in a pseudo-random manner within the space defined by his neighbors (see article”Understanding the core of the avatar method“), the higher the k used, the more the avatar of an individual may be different from the original individual.
Impact on the privacy/utility balance
The fact that the increase in the parameter k results in the enlargement of the modeling space of an avatar can be translated in the following way: the higher the k value, the more the generation of an avatar can take model on an individual different from the original individual, and in the same way, a low k value will preserve similarities with the original individual.
Take the example of a cohort of 50 left-handed individuals and 50 right-handed individuals.
On this dimension, the data will therefore have two clearly defined clusters of 50 individuals each. With a k less than 50, the avatar of a right-handed person will necessarily be right-handed because the neighbors of the original individual are all right-handed. On the other hand, with a k greater than 50, there is a probability that a right-handed individual will become left-handed because some of the neighbors of this individual are left-handed.
There is therefore a positive correlation between the value of k used and the level of privacy of an avatar dataset.
However, this is not the only impact of this parameter since beyond its influence on privacy, the k parameter also influences the utility, i.e. the preservation of the statistical properties of a data set.
In fact, we have just seen that increasing the parameter k leads to the production of avatar data that is more distant from the original individuals. This also implies a greater refocusing of individuals located on the periphery of a data set (outliers) and therefore a decrease in the conservation of the original variance of the data set.
Let's go back to the example of left- and right-handed people, but this time it's a cohort of 95 right-handers and 5 left-handers. With a k less than 5, a left-handed individual will necessarily give a left-handed avatar as well. On the other hand, as the k increases, the probability that a left-handed individual will generate a left-handed avatar decreases, at the same time decreasing the variance of the data set.
In short, the more k increases: the more privacy increases. And the more k decreases, the more utility is retained.
These two consequences are more fully detailed in our scientific article in the section”Impact of Local Model Size on Avatar Generation” where we evaluate the influence of k on privacy metrics as well as on the maintenance of the outcome of the clinical trial and the observational study.
The interesting effect of this parameter k on our method is the ability to protect unique individuals. The fact that an individual requires at least two neighbors to generate an avatar is a fundamental characteristic of the method. No individual with a unique characteristic can maintain that characteristic once transformed into an avatar.
Take the example of a cohort consisting of 99 right-handers and 1 left-handed. The avatar of the left-handed individual will necessarily be right-handed because the neighbors of this individual are right-handed. This characteristic makes it possible to protect individuals with unique attributes (outliers) in a systematic and agnostic manner.
Conclusion
The advantage of the parameter k is its ability to protect, in a systematic and agnostic manner, unique individuals.
It makes it possible to adjust the utility/privacy balance to the context in which anonymization is used, while maintaining the explainable and provable nature of the method.
The higher the k value, the more the generation of an avatar can take model on an individual different from the original individual, and in the same way, a low k value will preserve similarities with the original individual.