Services

De-identification of health data

Our software deidentificazione health data end-to-end turnkey, fully customizable, effectively tackles these challenges with the power of natural language processing (NLP).

With the increasing number of personal data produced and stored by organizations, the data privacy is becoming a growing priority. The data support the cutting-edge research, drive innovation and help in the development of solutions to real-world problems. This is particularly true in the health sector.

The right type, amount and quality of the digitized data, provide healthcare professionals with critical information about patients, consentonon communication with the patients the most effective, efficient, accelerate the diagnosis and allow you to provide better care and more efficient. Medical researchers and healthcare leverage the data to develop new drugs, treatments and vaccinations, to identify the risk factors of the disease, prevent, or deal with epidemics and spread the knowledge to improve public health, and extend human longevity.

However, the use of these data can compromise the privacy of the people to which they belong. In recent decades, laws such as HIPAA have evolved to protect the privacy of the people in the United States. Other countries have developed their own data privacy laws, including Canada, Australia and the EU (GDPR). Health care organizations, professionals and researchers in all of these countries must adhere to these regulations to show that take seriously the privacy of patients and to avoid the financial consequences and legal non-compliance. To achieve compliance, remove the medical data of the patients, or protected health information (PHI) of patients. The deidentificazione is also a requirement for organizations that want to train their models of machine learning to analyze or process the data at the level of the patient for research or other purposes.

The de-identification is a technique used to remove any data that could identify a person from a set of data. It is a way to protect your personal information that identifies an individual or a company by eliminating all the information of personal identification, so it is not possible to go back to the person to which the data belongs. Personal identifiers include:

  • Nme and surname,
  • Geographic data such as address, city and postal code
  • Dates directly related to an individual, such as date of birth, date of discharge, date of death, etc.
  • Tax code
  • Social insurance number
  • Phone numbers
  • Email addresses
  • Numbers of the clinical
  • Biometric identifiers, including fingerprints and footprints voice
  • Photographs recognizable
  • IP address of access to the system
  • bank account numbers (IBAN)

The de-identification is sometimes used interchangeably with the anonymizing, even if there is a certain difference:

  • the de-identification it involves removing explicit identifiers, personal, for example, by replacing real name with aliases fictitious generic for different people, so that it is not possible to assign unambiguously the patient starting from the its data is de-identified.
  • l’anonymous it focuses on the fact that the data can not be traced back to identify the individual, for example a numeric code uniquely associated with the person is a classic example of anonymization, in which from the numeric code you can with a table of the association go back to identify the patient..

In today's world, where there are many scandals about privacy, it is necessary to know and understand the importance of the de-identification of health data, in fact, this guarantees that the individuals ' data are not disclosed to third parties or disclosed in an inappropriate manner, thus limiting the potential damage to privacy and the GDPR.

The deidentificazione has become increasingly popular after the entry into force of the GDPR (general Regulation on data protection). But there are many cases in which the personal health information of people have been compromised without their knowledge (in the case of violation of personal data, or “data breach”), or without their consent due to lack of security measures put in place by health-related companies.

Traditionally, healthcare organizations and researchers were using manual methods to anonymize the data of the patients and prepare them for further processing and analysis. This resulted in the recruitment of a team of people, seen each document page-by-page and line-by-line. Then try any personal identification that can identify an individual, such as your name, address, phone numbers, etc. by removing and finally manually the id from the health information de-identified.

The limit of this approach is that it relies on the human eye and the human attention to detail. Since human beings are fallible, the auditor may not notice one or more identifiers of personal and approve incorrectly a document is not identified, and suitable for further processing and analysis by means of models of machine learning. This not only creates risks to the privacy of the individuals, but also increases the risk of non-compliance to the GDPR. This can create serious legal problems and financial for the organization in question in the case of sanctions by the Guarantor.

Another limit of the anonymous manual of health data is its slowness. The health sector level mondaiale produces billion clinical documents and this number is increasing year after year. Every health care organization is each year, hundreds of thousands, if not millions, of clinical documents. And are created every day more and more documents of this type. With the traditional approach it is impossible to review and make anonymous quickly all the data contained in these documents for research, studies on the effectiveness, evaluations, policies and other use cases.

Most of the data collected in various contexts are stored on a relational database, and is not, in which usually there are different types of data, recorded as structured and unstructured data. Unstructured data, data are generally stored in their native format (for example pdf documents or images in DICOM format), while the structured data are clearly defined, coded, and searchable. Considering the different type of data to be processed, the process of deidentificazione must be applied differently for each data type and can vary from the simple obfuscation or encryption to the more complex processes such as hashing or masking. The deidentificazione has the form of the recognition of the natural entities (NER) in NLP and can be divided into the following three categories:

  • Rules-based approach: applies to the use of rules, matching and dictionaries to anonymize your text documents. Although this approach requires a lot of experience in the sector and can be difficult to manage with the drift of the data, it is quite explainable.
  • Model-based approach: Researchers use machine learning algorithms to address the lack of resilience in the rule-based systems. This applies to the use of models ML to anonymize your text. This approach generalizes better to have a higher accuracy and a better acquisition context.
  • Hybrid approach: represents a balance pragmatic between both approaches is recommended. The recent developments in the field of deep learning and NLP have allowed systems to obtain better results, in particular in the field of named entity.

De-identification of health data

Our software deidentificazione health data end-to-end turnkey, fully customizable, effectively tackles these challenges with the power of natural language processing (NLP).

With the increasing number of personal data produced and stored by organizations, the data privacy is becoming a growing priority. The data support the cutting-edge research, drive innovation and help in the development of solutions to real-world problems. This is particularly true in the health sector.

The right type, amount and quality of the digitized data, provide healthcare professionals with critical information about patients, consentonon communication with the patients the most effective, efficient, accelerate the diagnosis and allow you to provide better care and more efficient. Medical researchers and healthcare leverage the data to develop new drugs, treatments and vaccinations, to identify the risk factors of the disease, prevent, or deal with epidemics and spread the knowledge to improve public health, and extend human longevity.

However, the use of these data can compromise the privacy of the people to which they belong. In recent decades, laws such as HIPAA have evolved to protect the privacy of the people in the United States. Other countries have developed their own data privacy laws, including Canada, Australia and the EU (GDPR). Health care organizations, professionals and researchers in all of these countries must adhere to these regulations to show that take seriously the privacy of patients and to avoid the financial consequences and legal non-compliance. To achieve compliance, remove the medical data of the patients, or protected health information (PHI) of patients. The deidentificazione is also a requirement for organizations that want to train their models of machine learning to analyze or process the data at the level of the patient for research or other purposes.

The de-identification is a technique used to remove any data that could identify a person from a set of data. It is a way to protect your personal information that identifies an individual or a company by eliminating all the information of personal identification, so it is not possible to go back to the person to which the data belongs. Personal identifiers include:

  • Nme and surname,
  • Geographic data such as address, city and postal code
  • Dates directly related to an individual, such as date of birth, date of discharge, date of death, etc.
  • Tax code
  • Social insurance number
  • Phone numbers
  • Email addresses
  • Numbers of the clinical
  • Biometric identifiers, including fingerprints and footprints voice
  • Photographs recognizable
  • IP address of access to the system
  • bank account numbers (IBAN)

The de-identification is sometimes used interchangeably with the anonymizing, even if there is a certain difference:

  • the de-identification it involves removing explicit identifiers, personal, for example, by replacing real name with aliases fictitious generic for different people, so that it is not possible to assign unambiguously the patient starting from the its data is de-identified.
  • l’anonymous it focuses on the fact that the data can not be traced back to identify the individual, for example a numeric code uniquely associated with the person is a classic example of anonymization, in which from the numeric code you can with a table of the association go back to identify the patient..

In today's world, where there are many scandals about privacy, it is necessary to know and understand the importance of the de-identification of health data, in fact, this guarantees that the individuals ' data are not disclosed to third parties or disclosed in an inappropriate manner, thus limiting the potential damage to privacy and the GDPR.

The deidentificazione has become increasingly popular after the entry into force of the GDPR (general Regulation on data protection). But there are many cases in which the personal health information of people have been compromised without their knowledge (in the case of violation of personal data, or “data breach”), or without their consent due to lack of security measures put in place by health-related companies.

Traditionally, healthcare organizations and researchers were using manual methods to anonymize the data of the patients and prepare them for further processing and analysis. This resulted in the recruitment of a team of people, seen each document page-by-page and line-by-line. Then try any personal identification that can identify an individual, such as your name, address, phone numbers, etc. by removing and finally manually the id from the health information de-identified.

The limit of this approach is that it relies on the human eye and the human attention to detail. Since human beings are fallible, the auditor may not notice one or more identifiers of personal and approve incorrectly a document is not identified, and suitable for further processing and analysis by means of models of machine learning. This not only creates risks to the privacy of the individuals, but also increases the risk of non-compliance to the GDPR. This can create serious legal problems and financial for the organization in question in the case of sanctions by the Guarantor.

Another limit of the anonymous manual of health data is its slowness. The health sector level mondaiale produces billion clinical documents and this number is increasing year after year. Every health care organization is each year, hundreds of thousands, if not millions, of clinical documents. And are created every day more and more documents of this type. With the traditional approach it is impossible to review and make anonymous quickly all the data contained in these documents for research, studies on the effectiveness, evaluations, policies and other use cases.

Most of the data collected in various contexts are stored on a relational database, and is not, in which usually there are different types of data, recorded as structured and unstructured data. Unstructured data, data are generally stored in their native format (for example pdf documents or images in DICOM format), while the structured data are clearly defined, coded, and searchable. Considering the different type of data to be processed, the process of deidentificazione must be applied differently for each data type and can vary from the simple obfuscation or encryption to the more complex processes such as hashing or masking. The deidentificazione has the form of the recognition of the natural entities (NER) in NLP and can be divided into the following three categories:

  • Rules-based approach: applies to the use of rules, matching and dictionaries to anonymize your text documents. Although this approach requires a lot of experience in the sector and can be difficult to manage with the drift of the data, it is quite explainable.
  • Model-based approach: Researchers use machine learning algorithms to address the lack of resilience in the rule-based systems. This applies to the use of models ML to anonymize your text. This approach generalizes better to have a higher accuracy and a better acquisition context.
  • Hybrid approach: represents a balance pragmatic between both approaches is recommended. The recent developments in the field of deep learning and NLP have allowed systems to obtain better results, in particular in the field of named entity.

Our team works closely with the health organization to review the general requirements, configuration, infrastructure, and anonymization of the data. It is carried out an analysis of the legal risk to understand the requirements of applicable law (GDPR, Privacy, etc). The analysis includes the type of information anonymize: names, phone numbers, addresses, and email, racial or ethnic origin, etc., and determines how to remove the identifiers of the patients to de-identify.
Determines how it must be presented in the output, if the data is not identified, must be replaced, for example, by adding random names from a directory, to be replaced by a placeholder or simply deleted.

After the previous phase, which provides a kind of “human”, the next phase provides for the use of the software NLP. Here is the code and pipelines are configured, depending on the context, by using NLP to remove the identifiers of the patient and the health professionals determined in the previous step.

This is once again a human passage, in which the team test and measure the performance and accuracy of a sample of data. In addition, jobs are executed, agreed, we set the sampling and install the platform for the cleaning of the data in the server farm of the customer.

Depending on the results of this step, the team modified, if necessary, the pipeline anonymization created in the previous step. If the accuracy and performance meet the required levels, the process goes to step anonymization.

In this phase, based on the system of NLP, the team manages the pipeline of de-identification that is extremely sophisticated to allow the treatment of complex texts are not structured and images, eliminating the need for processes to access the manual, time-consuming and subject to human error.

Our system NLP supports many types of unstructured text for de-identification, including

  • The tables are structured and data sets
  • Documents to text free
  • Documents (DICOM Imaging and digital communications in medicine).
  • Scanned PDF
  • Data of medical imaging
  • Images of pathology, and more

At the end of this phase on the identifiers of the patient can perform the following transformations:

  • Delete or replace text
  • Obfuscate names, places, organizations, etc.,
  • Generalize codes, dates, and addresses of the disease

The service anonymization does not stop to the anonymization of the data. It also includes operational support and continuous measurements of performance in order to maintain the quality, consistency and reliability of the results deidentificati. In particular, we conducted the following activities:

  • Improve models of NLP
  • Simplify incident response
  • Manage requests for removal GDPR and CCPA
  • Perform audits of data and processes

Our team works closely with the health organization to review the general requirements, configuration, infrastructure, and anonymization of the data. It is carried out an analysis of the legal risk to understand the requirements of applicable law (GDPR, Privacy, etc). The analysis includes the type of information anonymize: names, phone numbers, addresses, and email, racial or ethnic origin, etc., and determines how to remove the identifiers of the patients to de-identify.
Determines how it must be presented in the output, if the data is not identified, must be replaced, for example, by adding random names from a directory, to be replaced by a placeholder or simply deleted.

After the previous phase, which provides a kind of “human”, the next phase provides for the use of the software NLP. Here is the code and pipelines are configured, depending on the context, by using NLP to remove the identifiers of the patient and the health professionals determined in the previous step.

This is once again a human passage, in which the team test and measure the performance and accuracy of a sample of data. In addition, jobs are executed, agreed, we set the sampling and install the platform for the cleaning of the data in the server farm of the customer.

Depending on the results of this step, the team modified, if necessary, the pipeline anonymization created in the previous step. If the accuracy and performance meet the required levels, the process goes to step anonymization.

In this phase, based on the system of NLP, the team manages the pipeline of de-identification that is extremely sophisticated to allow the treatment of complex texts are not structured and images, eliminating the need for processes to access the manual, time-consuming and subject to human error.

Our system NLP supports many types of unstructured text for de-identification, including

  • The tables are structured and data sets
  • Documents to text free
  • Documents (DICOM Imaging and digital communications in medicine).
  • Scanned PDF
  • Data of medical imaging
  • Images of pathology, and more

At the end of this phase on the identifiers of the patient can perform the following transformations:

  • Delete or replace text
  • Obfuscate names, places, organizations, etc.,
  • Generalize codes, dates, and addresses of the disease

The service anonymization does not stop to the anonymization of the data. It also includes operational support and continuous measurements of performance in order to maintain the quality, consistency and reliability of the results deidentificati. In particular, we conducted the following activities:

  • Improve models of NLP
  • Simplify incident response
  • Manage requests for removal GDPR and CCPA
  • Perform audits of data and processes

© 2023 Copyright TALENCE Srl - P. IVA 10316311215 - All rights reserved