September 19, 2024

Preprocessing and Anonymization

This article proposes the avatar anonymization method as a preprocessing tool to eliminate outliers.

The avatar method uses an individual-centered approach. Each original observation generates a local stochastic simulation leading to its avatar.

The objective of this method is to create an avatar (new anonymous individual) for each original individual in order to protect the latter's personal - potentially identifying - information.

‍

Consider a sensitive data set of size $ (n, p) $, where $n$ is the number of individuals, with $p$ variables.

Individuals are projected into a digital multidimensional space. We can represent each individual $X_i$ by their coordinates:

$$\ begin {aligned}
X_1 (x_ {11},\, x_ {12}, &\,\ ldots,\, x_ {1p})\\
X_2 (x_ {21},\, x_ {22}, &\,\ ldots,\, x_ {2p})\\
\ vdots\\
X_n (x_ {2n},\, x_ {n2}, &\,\ ldots,\, x_ {nl})\\
\ end {aligned} $$

‍

For each individual $i$, one can identify their $k$ nearest neighbors ordered from nearest to farthest $V_i$ = ($V_ {i,0} $,..., $V_ {i, k} $), where each $V_ {i, k} $), where each $V_ {i, k} $) is an individual of $X$. For each $i$, we introduce the function $\ phi_i$ which, with each element $j$ from $0$ to $k$, associates the index of $V_ {i, j} $ in the $X$ matrix. So we have $X_ {\ phi (j)} = V_ {i, j} $.

During the avatarization, we will then obtain for each $X_i$ a new point $X_i'$ such that:

$$X_i' =\ sum_ {j=0} ^ {k} w_ {i, j} X_ {\ phi (j)}} $$

Where:

- $\ phi$ was set just above

- $k$ is the parameter of the avatarization method, defining the number of neighbors to consider for each point.

- $X_j$ is the individual j

- $w_ {i, j} $ are the weights of the $k$ closest neighbors of the individual $X_i$.

- $X_i'$ is the coordinate vector of the avatar of the individual $X_i$.

‍

Each $w_ {i, j} $ can be calculated based on:

$$w_ {i, j} =\ frac {P_j} {\ sum_ {t=0} ^k P_t} $$

‍

Each $P_t$ represents the weight assigned to each neighbor of the calculated point. In fact, each neighbor contributes differently to the new value of the point depending on their distance and other parameters in relation to the point. For example, distant points will have a smaller contribution compared to nearby points. For more information, see the doc [1].

**Transformation of an original data set into an avatar.**

Removing outliers

Outliers, or outliers, are observations that differ significantly from other data. They can bias learning models or other statistical models, and reduce their accuracy. At the same time, in order to protect the most particular individuals, avatarization tends to refocus on the mass of individuals: we can see this transformation as an operation that eliminates outliers. By eliminating outliers, avatarization could improve model performance. Cleaner, more consistent data could allow models to generalize better and perform better.

‍

Demonstration

A data set is modelled as a sum of data X with a certain noise N that follows the normal distribution.

‍

**Effect of Adding Noise on Data: Comparing Original Data, Noise, and Noise Data**

We consider a data set, where random noise has been added to the original values, which may come from measurement errors, environmental variations or collection anomalies. Actual data can be distributed according to a certain distribution. When noise, with a large standard deviation, is added to this data, it is then likely to contain outliers.

Thus, each point in space can be expressed in the form:

$$y_i = X_i + N_i$$

‍

With $N_i (n_ {i,0},..., n_ {i, k}) $ a random variable that follows a normal distribution, $N_i\ sim (0,\ sigma ^2) $. The bigger $\ sigma ^2$ is, the more the game presents outlier data.

By applying the avatarization each $Y_i$ is transformed into $Z_i$ as follows:

$$\ begin {flalign}
z_I &=\ sum_ {j=0} ^k w_ {i, j} y_J\\
&=\ sum_ {j=0} ^k w_ {i, j} x_J +\ sum_ {j=0} ^k w_ {i, j} n_ {i, j} n_ {i, j} n_ {i, j}\
\ end {flalign}
$$

‍

The new point, obtained by the weighted sum of $X_j$ with the weights $w_ {i, j} $, is the avatarized point of the original data, i.e. the result of the avatarization of the original points only.

As avatarization is a transformation that homogenizes information, the new source of noise or outliers $N'$ is potentially avatar noise: $N' =\ sum_ {j=0} ^k w_ {i, j} n_ {i, j} n_ {i, j} $.

Hence the need to study this term and its impact on the new avatar.

Analysis of the variance of the term $N'$:

$V (\ sum_ {j=0} ^k w_ {i, j} n_ {i, j}) =\ sum_ {j=0} ^k V (w_ {i, j} n_ {i, j} n_ {i, j}) $ ($N_i$ are independent random variables)
$\ quad\ quad\ quad\ quad=\ sigma^2\ sum_ {j=0} ^k w_ {i, j} ^2$ (hypothesis on the normal distribution of $N_i$)

‍

Now:

$$\ begin {flalign}
\ sum_ {i=0} ^k w_i^2 &=\ sum_ {i=0} ^k (\ frac {P_i} {\ sum_ {j=0} ^kP_j}) ^2\\
&=\ frac {\ sum_ {i=0} ^k P_i^2} {(\ sum_ {i=0} ^k P_i) ^2} < 1
\ end {flalign}
$$

‍

We will then have:

$$
\ sigma ^2\ sum_ {j=0} ^k w_ {i, j} ^2 <\ sigma ^2
$$
From where
$$
V (\ sum_ {j=0} ^k w_ {i, j} n_ {i, j}) < V (n_i)
$$

‍

Thus, using avatarization, we can say that the variance of the noise of the anonymized values is less than that of the noise of the original values. This can be explained by reduced noise on our data set, and as a result, outliers will be effectively eliminated.

By eliminating outliers, we therefore improve the accuracy of learning or statistical models. This prevents extreme data from biasing results, ensuring more reliable predictions and better generalization to new data. As a result, the models are more robust and efficient.

‍

Conclusion

In this article, we therefore showed how avatarization could be part of a pre-processing method aimed at reducing the noise of the training data, in order to perfect the convergence of the model to be trained.

Note: noise has been defined as following a normal distribution in this article, which is a potentially simplistic model of reality.

‍

Resources:

[1] https://www.nature.com/articles/s41746-023-00771-5

‍

Written by: Karl Saliba, Julien Petot & Gaël Russeil