Machine Learning has revolutionized the way computers perform tasks by allowing them to learn from models and data without explicit programming. However, accessing data for machine learning presents unique challenges due to privacy concerns and regulatory frameworks, which is complex and time consuming. In addition, a model formed from personal data always presents a risk of re-identifying the data used if the model were to be made public. To overcome these obstacles, Octopize developed Avatar, its data anonymization software that allows health data to be used to train models without compromising patient privacy. By striking the right balance between privacy protection and data accessibility, researchers can harness health data to advance health research while maintaining patient trust.
Machine learning
Machine learning is a powerful approach that teaches computers how to perform tasks by learning from data, without being explicitly programmed. It is based on the idea that machines can automatically improve their performance on a task over time, by analyzing and recognizing patterns in large amounts of data. Through a training process, algorithms are trained on labelled data in order to identify patterns and make predictions or decisions. This technology has applications in a variety of fields such as image and speech recognition, recommendation systems, healthcare, finance, and many more. While the concept may seem complex, machine learning simplifies our tasks by allowing computers to learn and make intelligent decisions, which in turn advances technology and improves our daily lives.
Difficulties
However, in this day and age, it is not easy to access the large amount of data needed to train the model. Two main obstacles stand out: the difficulty of accessing data and the risk to the privacy of the individual.
Access that takes time
Accessing data for machine learning can be a long and complex process.
In particular, for so-called sensitive health data, it is a question of obtaining the necessary authorizations, ensuring compliance with privacy protection regulations (HIPAA or RGPD) and establishing secure data sharing agreements with the organizations and institutions concerned. Researchers and data scientists often have to go through comprehensive review processes and seek approval from ethics committees or institutional review boards to access and use health data. These committees assess research goals, potential risks to patient privacy, and the adequacy of security measures before allowing access. This review process can take a long time, sometimes months or even longer, depending on the complexity and sensitivity of the data.
In addition to privacy concerns and regulatory frameworks, data dispersion can further complicate data access and integration. Health data is generally scattered across multiple healthcare facilities, making it difficult to gather a comprehensive data set to train machine learning models. Data-sharing agreements need to be in place with each organization, and compatibility and interoperability issues can arise due to the different data formats, systems, and protocols used, which can further extend the time required before data access is obtained.
Overall, accessing health data for machine learning training faces significant barriers due to privacy concerns, regulatory frameworks, and the complexity of data integration. These challenges can lead to time-consuming processes and delays in obtaining access to health data.
Threat to patient privacy
Another major barrier relates to privacy concerns associated with health data. Training a model on sensitive patient information presents the risk of re-identifying individuals, since an attacker can, among other things, infer belonging to the data set using the predictions made by the model. This is called an attack by inference of belonging [1]. Indeed, even if the data is not available, the model has encoded all the information in the data, and without adequate protection against specific attacks, there is still the underlying risk of obtaining information about the data through the model.
Anonymization as a solution
To overcome these difficulties and speed up the process of accessing medical data for the creation of machine learning models, anonymization techniques can be used.
There is no reason to risk re-identifying patients if the same performances can be achieved with anonymized data rather than with original data. [2]
When properly applied, anonymization can significantly mitigate the risk of re-identification and speed up data access because much of the procedures can be circumvented due to the fact that the data is no longer linked to real patients.
At Octopize, we have developed avatar, a unique anonymization software that protects the privacy of individuals while maintaining the quality and usefulness of the data. Anonymization is the process of changing or deleting identifiable information from the data set in order to ensure patient confidentiality.
We have unique metrics to prove that privacy compliance complies with European anonymization regulations (EDPS opinion of 10/2014) and measures to ensure that most of the usefulness has been retained after transformation.
In addition, we have empirically validated that Machine Learning models trained on synthetic data have the same predictive power as models trained on original data, especially in the health sector [2] [3]. This is illustrated in the figure below [4].

Conclusion
In conclusion, accessing health data for machine learning training poses significant problems due to privacy concerns and the regulatory framework, and there is a risk of violating people's privacy through the model. However, the use of anonymization techniques such as the avatar method automatically ensures compliance with privacy regulations because anonymized data is not linked to an individual, and there is no risk of further attacks, as any attack would be aimed at anonymized data and not at the original data. Overall, researchers can save valuable months by harnessing health data to advance healthcare research, while maintaining privacy and ethical standards, and maintaining public trust.
References:
[1] Shokri et. al., Membership Inference Attacks against Machine Learning Models,. https://doi.org/10.48550/arXiv.1610.05820
[2] Guillaudeux et al, Patient-centric synthetic data generation, no reason to risk re-identification in biomedical data analysis, https://doi.org/10.1038/s41746-023-00771-5
[3] Bennis et.al, Application of a novel Anonymization Method for Electrocardiogram Data, https://doi.org/10.1145/3485557.3485581
[4] Barreteau et. al, Generation of anonymous signals from non-anonymous data using a local linear mixture model, GRETSI 2023, (available soon)