Understanding the core of the avatar method

After publishing a scientific paper on our method, we offer you an article to understand the core of the method. Note that this is a summary explanation, if you want to go further, do not hesitate to consult our article on Nature Digital Medicine!

Understanding the core of the avatar method

In this blog post, we are going to popularize our scientific article published in Nature Digital Medicine which describes the principles of our avatar anonymization method.

The avatar method is a unique approach to generating synthetic data that preserves the structure and statistical relevance of the original data set while respecting the privacy of individuals. This technique uses a patient-centered approach by creating local simulations based on each individual, making the simulation of an avatar unique. Our method is designed to meet the three criteria set out by the European Data Protection Board (EDPS) to assess the robustness of an anonymization process: individualization, correlation and inference.

Compared to other techniques such as decision trees and GANs (Generative Adversarial Networks), the avatar software demonstrates similar utility in maintaining the structure and statistical relevance of the original data set. In addition, the avatar software includes privacy measures that assess avatar data against the three criteria defined by the EDPS.

Explanation of how the method works

Our method takes original data as input and produces synthetic and anonymous data of the same size and nature. Numerical data stays numeric, categorical data stays categorical, etc. The core of the method is illustrated in the diagram below. We describe it in more detail in the following paragraphs.

a) Multidimensional projection

The original data is projected into an appropriate multi-dimensional space using dimensional reduction techniques such as mixed data factor analysis (FAMD), principal component analysis (PCA), or multiple correspondence analysis (ACM). The transformations used must be reversible, that is, there is an inverse transformation that allows you to return to the original data. This step transforms individuals, who are initially described by multiple numerical and categorical characteristics, into structured numerical coordinates that make it easy to calculate distances between individuals. It also reduces the dimensionality of the data set in order to highlight the most relevant information.

b) Calculation of k-neighbors

The distances between neighbors are then calculated between all the points in this space in order to perform a K-nearest neighbors (KNN) algorithm. This defines a local area around each coordinate - each being the projection of an individual from the original data - defined by its closest neighbors.

c) Random generation of local avatar data

For each of these local areas, a unique simulation is drawn in a pseudo-random manner, creating a new coordinate within the zone, which we call the avatar of the original coordinate. This simulation is influenced by the distance between the point of origin and each of its neighbors, by a random weight following an exponential distribution and by a random contribution factor for each neighbor. This allows non-deterministic simulations to be an irreversible process, which is a necessary condition for maintaining privacy.

d) Inversion of the transformation to return to the original encoding
Once a summary of data has been generated for each individual, the coordinates of the avatar are reversed to return to the original encoding, maintaining the type of the original attributes (categorical, numerical, etc.). Although we are not able to retrieve the original data from the avatar data, the structure of the data set is preserved:

e) Calculation of privacy protection parameters

The European Data Protection Board (EDPS) has defined three criteria that must be met for a dataset to be considered anonymous: individualization, correlation and inference. The avatar software includes metrics to measure the privacy of avatar data based on these three criteria.

To learn more about our privacy metrics, see our dedicated article.

Conclusion

The patient-centered nature of the avatar method allows the calculation of privacy measures that meet the EDPS criteria while maintaining a high level of signal preservation. Its explainable approach allows data to be shared without compromising privacy, making it valuable software for generating anonymous synthetic data for research purposes while maintaining the privacy of individuals.

Do not hesitate to read our scientific article or watch our Webinar tech in replay for more information on the avatar method.

Editors: Gaël Russeil & Alban-Félix Barreteau

Sign up for our tech newsletter!