De-identification of health data

Our software deidentificazione health data end-to-end turnkey, fully customizable, effectively tackles these challenges with the power of natural language processing (NLP).

Services

Introduction

With the increasing number of personal data produced and stored by organizations, the data privacy is becoming a growing priority. The data support the cutting-edge research, drive innovation and help in the development of solutions to real-world problems. This is particularly true in the health sector.

The right type, amount and quality of the digitized data, provide healthcare professionals with critical information about patients, consentonon communication with the patients the most effective, efficient, accelerate the diagnosis and allow you to provide better care and more efficient. Medical researchers and healthcare leverage the data to develop new drugs, treatments and vaccinations, to identify the risk factors of the disease, prevent, or deal with epidemics and spread the knowledge to improve public health, and extend human longevity.

However, the use of these data can compromise the privacy of the people to which they belong. In recent decades, laws such as HIPAA have evolved to protect the privacy of the people in the United States. Other countries have developed their own data privacy laws, including Canada, Australia and the EU (GDPR). Health care organizations, professionals and researchers in all of these countries must adhere to these regulations to show that take seriously the privacy of patients and to avoid the financial consequences and legal non-compliance. To achieve compliance, remove the medical data of the patients, or protected health information (PHI) of patients. The deidentificazione is also a requirement for organizations that want to train their models of machine learning to analyze or process the data at the level of the patient for research or other purposes.

What is the de-identification?

The de-identification is a technique used to remove any data that could identify a person from a set of data. It is a way to protect your personal information that identifies an individual or a company by eliminating all the information of personal identification, so it is not possible to go back to the person to which the data belongs. Personal identifiers include:

Nme and surname,
Geographic data such as address, city and postal code
Dates directly related to an individual, such as date of birth, date of discharge, date of death, etc.
Tax code
Social insurance number
Phone numbers
Email addresses
Numbers of the clinical
Biometric identifiers, including fingerprints and footprints voice
Photographs recognizable
IP address of access to the system
bank account numbers (IBAN)

The de-identification is sometimes used interchangeably with the anonymizing, even if there is a certain difference:

the de-identification it involves removing explicit identifiers, personal, for example, by replacing real name with aliases fictitious generic for different people, so that it is not possible to assign unambiguously the patient starting from the its data is de-identified.
l’anonymous it focuses on the fact that the data can not be traced back to identify the individual, for example a numeric code uniquely associated with the person is a classic example of anonymization, in which from the numeric code you can with a table of the association go back to identify the patient..

The importance of de-identification in the context of health care

In today's world, where there are many scandals about privacy, it is necessary to know and understand the importance of the de-identification of health data, in fact, this guarantees that the individuals ' data are not disclosed to third parties or disclosed in an inappropriate manner, thus limiting the potential damage to privacy and the GDPR.

The deidentificazione has become increasingly popular after the entry into force of the GDPR (general Regulation on data protection). But there are many cases in which the personal health information of people have been compromised without their knowledge (in the case of violation of personal data, or “data breach”), or without their consent due to lack of security measures put in place by health-related companies.

A traditional approach to de-identification of health data

Traditionally, healthcare organizations and researchers were using manual methods to anonymize the data of the patients and prepare them for further processing and analysis. This resulted in the recruitment of a team of people, seen each document page-by-page and line-by-line. Then try any personal identification that can identify an individual, such as your name, address, phone numbers, etc. by removing and finally manually the id from the health information de-identified.

The limit of this approach is that it relies on the human eye and the human attention to detail. Since human beings are fallible, the auditor may not notice one or more identifiers of personal and approve incorrectly a document is not identified, and suitable for further processing and analysis by means of models of machine learning. This not only creates risks to the privacy of the individuals, but also increases the risk of non-compliance to the GDPR. This can create serious legal problems and financial for the organization in question in the case of sanctions by the Guarantor.

Another limit of the anonymous manual of health data is its slowness. The health sector level mondaiale produces billion clinical documents and this number is increasing year after year. Every health care organization is each year, hundreds of thousands, if not millions, of clinical documents. And are created every day more and more documents of this type. With the traditional approach it is impossible to review and make anonymous quickly all the data contained in these documents for research, studies on the effectiveness, evaluations, policies and other use cases.

Alternative approaches to de-identification

Most of the data collected in various contexts are stored on a relational database, and is not, in which usually there are different types of data, recorded as structured and unstructured data. Unstructured data, data are generally stored in their native format (for example pdf documents or images in DICOM format), while the structured data are clearly defined, coded, and searchable. Considering the different type of data to be processed, the process of deidentificazione must be applied differently for each data type and can vary from the simple obfuscation or encryption to the more complex processes such as hashing or masking. The deidentificazione has the form of the recognition of the natural entities (NER) in NLP and can be divided into the following three categories:

Rules-based approach: applies to the use of rules, matching and dictionaries to anonymize your text documents. Although this approach requires a lot of experience in the sector and can be difficult to manage with the drift of the data, it is quite explainable.
Model-based approach: Researchers use machine learning algorithms to address the lack of resilience in the rule-based systems. This applies to the use of models ML to anonymize your text. This approach generalizes better to have a higher accuracy and a better acquisition context.
Hybrid approach: represents a balance pragmatic between both approaches is recommended. The recent developments in the field of deep learning and NLP have allowed systems to obtain better results, in particular in the field of named entity.

Home

De-identification of health data

Our software deidentificazione health data end-to-end turnkey, fully customizable, effectively tackles these challenges with the power of natural language processing (NLP).

Services

Introduction

With the increasing number of personal data produced and stored by organizations, the data privacy is becoming a growing priority. The data support the cutting-edge research, drive innovation and help in the development of solutions to real-world problems. This is particularly true in the health sector.

The right type, amount and quality of the digitized data, provide healthcare professionals with critical information about patients, consentonon communication with the patients the most effective, efficient, accelerate the diagnosis and allow you to provide better care and more efficient. Medical researchers and healthcare leverage the data to develop new drugs, treatments and vaccinations, to identify the risk factors of the disease, prevent, or deal with epidemics and spread the knowledge to improve public health, and extend human longevity.

However, the use of these data can compromise the privacy of the people to which they belong. In recent decades, laws such as HIPAA have evolved to protect the privacy of the people in the United States. Other countries have developed their own data privacy laws, including Canada, Australia and the EU (GDPR). Health care organizations, professionals and researchers in all of these countries must adhere to these regulations to show that take seriously the privacy of patients and to avoid the financial consequences and legal non-compliance. To achieve compliance, remove the medical data of the patients, or protected health information (PHI) of patients. The deidentificazione is also a requirement for organizations that want to train their models of machine learning to analyze or process the data at the level of the patient for research or other purposes.

What is the de-identification?

The de-identification is a technique used to remove any data that could identify a person from a set of data. It is a way to protect your personal information that identifies an individual or a company by eliminating all the information of personal identification, so it is not possible to go back to the person to which the data belongs. Personal identifiers include:

Nme and surname,
Geographic data such as address, city and postal code
Dates directly related to an individual, such as date of birth, date of discharge, date of death, etc.
Tax code
Social insurance number
Phone numbers
Email addresses
Numbers of the clinical
Biometric identifiers, including fingerprints and footprints voice
Photographs recognizable
IP address of access to the system
bank account numbers (IBAN)

The de-identification is sometimes used interchangeably with the anonymizing, even if there is a certain difference:

the de-identification it involves removing explicit identifiers, personal, for example, by replacing real name with aliases fictitious generic for different people, so that it is not possible to assign unambiguously the patient starting from the its data is de-identified.
l’anonymous it focuses on the fact that the data can not be traced back to identify the individual, for example a numeric code uniquely associated with the person is a classic example of anonymization, in which from the numeric code you can with a table of the association go back to identify the patient..

The importance of de-identification in the context of health care

In today's world, where there are many scandals about privacy, it is necessary to know and understand the importance of the de-identification of health data, in fact, this guarantees that the individuals ' data are not disclosed to third parties or disclosed in an inappropriate manner, thus limiting the potential damage to privacy and the GDPR.

The deidentificazione has become increasingly popular after the entry into force of the GDPR (general Regulation on data protection). But there are many cases in which the personal health information of people have been compromised without their knowledge (in the case of violation of personal data, or “data breach”), or without their consent due to lack of security measures put in place by health-related companies.

A traditional approach to de-identification of health data

Traditionally, healthcare organizations and researchers were using manual methods to anonymize the data of the patients and prepare them for further processing and analysis. This resulted in the recruitment of a team of people, seen each document page-by-page and line-by-line. Then try any personal identification that can identify an individual, such as your name, address, phone numbers, etc. by removing and finally manually the id from the health information de-identified.

The limit of this approach is that it relies on the human eye and the human attention to detail. Since human beings are fallible, the auditor may not notice one or more identifiers of personal and approve incorrectly a document is not identified, and suitable for further processing and analysis by means of models of machine learning. This not only creates risks to the privacy of the individuals, but also increases the risk of non-compliance to the GDPR. This can create serious legal problems and financial for the organization in question in the case of sanctions by the Guarantor.

Another limit of the anonymous manual of health data is its slowness. The health sector level mondaiale produces billion clinical documents and this number is increasing year after year. Every health care organization is each year, hundreds of thousands, if not millions, of clinical documents. And are created every day more and more documents of this type. With the traditional approach it is impossible to review and make anonymous quickly all the data contained in these documents for research, studies on the effectiveness, evaluations, policies and other use cases.

Alternative approaches to de-identification

Most of the data collected in various contexts are stored on a relational database, and is not, in which usually there are different types of data, recorded as structured and unstructured data. Unstructured data, data are generally stored in their native format (for example pdf documents or images in DICOM format), while the structured data are clearly defined, coded, and searchable. Considering the different type of data to be processed, the process of deidentificazione must be applied differently for each data type and can vary from the simple obfuscation or encryption to the more complex processes such as hashing or masking. The deidentificazione has the form of the recognition of the natural entities (NER) in NLP and can be divided into the following three categories:

Rules-based approach: applies to the use of rules, matching and dictionaries to anonymize your text documents. Although this approach requires a lot of experience in the sector and can be difficult to manage with the drift of the data, it is quite explainable.
Model-based approach: Researchers use machine learning algorithms to address the lack of resilience in the rule-based systems. This applies to the use of models ML to anonymize your text. This approach generalizes better to have a higher accuracy and a better acquisition context.
Hybrid approach: represents a balance pragmatic between both approaches is recommended. The recent developments in the field of deep learning and NLP have allowed systems to obtain better results, in particular in the field of named entity.

Context analysis

Our team works closely with the health organization to review the general requirements, configuration, infrastructure, and anonymization of the data. It is carried out an analysis of the legal risk to understand the requirements of applicable law (GDPR, Privacy, etc). The analysis includes the type of information anonymize: names, phone numbers, addresses, and email, racial or ethnic origin, etc., and determines how to remove the identifiers of the patients to de-identify.
Determines how it must be presented in the output, if the data is not identified, must be replaced, for example, by adding random names from a directory, to be replaced by a placeholder or simply deleted.

Removal of identification

After the previous phase, which provides a kind of “human”, the next phase provides for the use of the software NLP. Here is the code and pipelines are configured, depending on the context, by using NLP to remove the identifiers of the patient and the health professionals determined in the previous step.

Measurement of results

This is once again a human passage, in which the team test and measure the performance and accuracy of a sample of data. In addition, jobs are executed, agreed, we set the sampling and install the platform for the cleaning of the data in the server farm of the customer.

Depending on the results of this step, the team modified, if necessary, the pipeline anonymization created in the previous step. If the accuracy and performance meet the required levels, the process goes to step anonymization.

De-identification of data

In this phase, based on the system of NLP, the team manages the pipeline of de-identification that is extremely sophisticated to allow the treatment of complex texts are not structured and images, eliminating the need for processes to access the manual, time-consuming and subject to human error.

Our system NLP supports many types of unstructured text for de-identification, including

The tables are structured and data sets
Documents to text free
Documents (DICOM Imaging and digital communications in medicine).
Scanned PDF
Data of medical imaging
Images of pathology, and more

At the end of this phase on the identifiers of the patient can perform the following transformations:

Delete or replace text
Obfuscate names, places, organizations, etc.,
Generalize codes, dates, and addresses of the disease

Monitoring data is de-identified

The service anonymization does not stop to the anonymization of the data. It also includes operational support and continuous measurements of performance in order to maintain the quality, consistency and reliability of the results deidentificati. In particular, we conducted the following activities:

Improve models of NLP
Simplify incident response
Manage requests for removal GDPR and CCPA
Perform audits of data and processes

Context analysis

Our team works closely with the health organization to review the general requirements, configuration, infrastructure, and anonymization of the data. It is carried out an analysis of the legal risk to understand the requirements of applicable law (GDPR, Privacy, etc). The analysis includes the type of information anonymize: names, phone numbers, addresses, and email, racial or ethnic origin, etc., and determines how to remove the identifiers of the patients to de-identify.
Determines how it must be presented in the output, if the data is not identified, must be replaced, for example, by adding random names from a directory, to be replaced by a placeholder or simply deleted.

Removal of identification

After the previous phase, which provides a kind of “human”, the next phase provides for the use of the software NLP. Here is the code and pipelines are configured, depending on the context, by using NLP to remove the identifiers of the patient and the health professionals determined in the previous step.

Measurement of results

This is once again a human passage, in which the team test and measure the performance and accuracy of a sample of data. In addition, jobs are executed, agreed, we set the sampling and install the platform for the cleaning of the data in the server farm of the customer.

Depending on the results of this step, the team modified, if necessary, the pipeline anonymization created in the previous step. If the accuracy and performance meet the required levels, the process goes to step anonymization.

De-identification of data

In this phase, based on the system of NLP, the team manages the pipeline of de-identification that is extremely sophisticated to allow the treatment of complex texts are not structured and images, eliminating the need for processes to access the manual, time-consuming and subject to human error.

Our system NLP supports many types of unstructured text for de-identification, including

The tables are structured and data sets
Documents to text free
Documents (DICOM Imaging and digital communications in medicine).
Scanned PDF
Data of medical imaging
Images of pathology, and more

At the end of this phase on the identifiers of the patient can perform the following transformations:

Delete or replace text
Obfuscate names, places, organizations, etc.,
Generalize codes, dates, and addresses of the disease

Monitoring data is de-identified

The service anonymization does not stop to the anonymization of the data. It also includes operational support and continuous measurements of performance in order to maintain the quality, consistency and reliability of the results deidentificati. In particular, we conducted the following activities:

Improve models of NLP
Simplify incident response
Manage requests for removal GDPR and CCPA
Perform audits of data and processes

De-identification of health data

De-identification of health data

Useful links

Legal and operational headquarter

Industrial headquarter