De-identification of health data
Our software deidentificazione health data end-to-end turnkey, fully customizable, effectively tackles these challenges with the power of natural language processing (NLP).
Introduction
With the increasing number of personal data produced and stored by organizations, the data privacy is becoming a growing priority. The data support the cutting-edge research, drive innovation and help in the development of solutions to real-world problems. This is particularly true in the health sector.
The right type, amount and quality of the digitized data, provide healthcare professionals with critical information about patients, consentonon communication with the patients the most effective, efficient, accelerate the diagnosis and allow you to provide better care and more efficient. Medical researchers and healthcare leverage the data to develop new drugs, treatments and vaccinations, to identify the risk factors of the disease, prevent, or deal with epidemics and spread the knowledge to improve public health, and extend human longevity.
However, the use of these data can compromise the privacy of the people to which they belong. In recent decades, laws such as HIPAA have evolved to protect the privacy of the people in the United States. Other countries have developed their own data privacy laws, including Canada, Australia and the EU (GDPR). Health care organizations, professionals and researchers in all of these countries must adhere to these regulations to show that take seriously the privacy of patients and to avoid the financial consequences and legal non-compliance. To achieve compliance, remove the medical data of the patients, or protected health information (PHI) of patients. The deidentificazione is also a requirement for organizations that want to train their models of machine learning to analyze or process the data at the level of the patient for research or other purposes.
What is the de-identification?
The de-identification is a technique used to remove any data that could identify a person from a set of data. It is a way to protect your personal information that identifies an individual or a company by eliminating all the information of personal identification, so it is not possible to go back to the person to which the data belongs. Personal identifiers include:
- Nme and surname,
- Geographic data such as address, city and postal code
- Dates directly related to an individual, such as date of birth, date of discharge, date of death, etc.
- Tax code
- Social insurance number
- Phone numbers
- Email addresses
- Numbers of the clinical
- Biometric identifiers, including fingerprints and footprints voice
- Photographs recognizable
- IP address of access to the system
- bank account numbers (IBAN)
The de-identification is sometimes used interchangeably with the anonymizing, even if there is a certain difference:
- the de-identification it involves removing explicit identifiers, personal, for example, by replacing real name with aliases fictitious generic for different people, so that it is not possible to assign unambiguously the patient starting from the its data is de-identified.
- l’anonymous it focuses on the fact that the data can not be traced back to identify the individual, for example a numeric code uniquely associated with the person is a classic example of anonymization, in which from the numeric code you can with a table of the association go back to identify the patient..
The importance of de-identification in the context of health care
In today's world, where there are many scandals about privacy, it is necessary to know and understand the importance of the de-identification of health data, in fact, this guarantees that the individuals ' data are not disclosed to third parties or disclosed in an inappropriate manner, thus limiting the potential damage to privacy and the GDPR.
The deidentificazione has become increasingly popular after the entry into force of the GDPR (general Regulation on data protection). But there are many cases in which the personal health information of people have been compromised without their knowledge (in the case of violation of personal data, or “data breach”), or without their consent due to lack of security measures put in place by health-related companies.
A traditional approach to de-identification of health data
Traditionally, healthcare organizations and researchers were using manual methods to anonymize the data of the patients and prepare them for further processing and analysis. This resulted in the recruitment of a team of people, seen each document page-by-page and line-by-line. Then try any personal identification that can identify an individual, such as your name, address, phone numbers, etc. by removing and finally manually the id from the health information de-identified.
The limit of this approach is that it relies on the human eye and the human attention to detail. Since human beings are fallible, the auditor may not notice one or more identifiers of personal and approve incorrectly a document is not identified, and suitable for further processing and analysis by means of models of machine learning. This not only creates risks to the privacy of the individuals, but also increases the risk of non-compliance to the GDPR. This can create serious legal problems and financial for the organization in question in the case of sanctions by the Guarantor.
Another limit of the anonymous manual of health data is its slowness. The health sector level mondaiale produces billion clinical documents and this number is increasing year after year. Every health care organization is each year, hundreds of thousands, if not millions, of clinical documents. And are created every day more and more documents of this type. With the traditional approach it is impossible to review and make anonymous quickly all the data contained in these documents for research, studies on the effectiveness, evaluations, policies and other use cases.
Alternative approaches to de-identification
Most of the data collected in various contexts are stored on a relational database, and is not, in which usually there are different types of data, recorded as structured and unstructured data. Unstructured data, data are generally stored in their native format (for example pdf documents or images in DICOM format), while the structured data are clearly defined, coded, and searchable. Considering the different type of data to be processed, the process of deidentificazione must be applied differently for each data type and can vary from the simple obfuscation or encryption to the more complex processes such as hashing or masking. The deidentificazione has the form of the recognition of the natural entities (NER) in NLP and can be divided into the following three categories:
- Rules-based approach: applies to the use of rules, matching and dictionaries to anonymize your text documents. Although this approach requires a lot of experience in the sector and can be difficult to manage with the drift of the data, it is quite explainable.
- Model-based approach: Researchers use machine learning algorithms to address the lack of resilience in the rule-based systems. This applies to the use of models ML to anonymize your text. This approach generalizes better to have a higher accuracy and a better acquisition context.
- Hybrid approach: represents a balance pragmatic between both approaches is recommended. The recent developments in the field of deep learning and NLP have allowed systems to obtain better results, in particular in the field of named entity.