Degrees of Identifiability hero image

Levels of Data De-Identification Explained

What do regulators and lawyers mean when they talk about de-identification? Despite how it’s often discussed, there isn’t a universal meaning in the world of data privacy.  Differences between varying degrees of de-identification are often subtle or unclear, leading to confusion for organizations learning to navigate privacy regulations such as the GDPR or HIPAA 

For instance, how does anonymous data differ from pseudonymous or de-identified information?

Data privacy lies on a spectrum, but it can still be broken down into levels so it’s easier to understand and (more importantly) implement.

There are 3 primary factors to consider when assessing the comparative strengths of each level:

  • Direct identifiers — data that identifies a person without additional information or by linking to information in the public domain (e.g., name, address, phone number, SSN).
  • Indirect identifiers — data that identifies an individual indirectly. In other words, it helps connect pieces of information until an individual can be singled out (e.g., DOB, gender)
  • Safeguards and controls — technical, organizational, and legal controls helping prevent employees, researchers, or other third parties from re-identifying individuals.

Download the whitepaper on data security in healthcare

Basic Degrees of Identifiability

At this level, there are no real steps taken to de-identify the data, though at some stages there may be some basic safeguards in place. This information contains both direct and indirect identifiers, with no effort made to mask or eliminate either.

Explicitly Personal

This is simply the lack of any privacy or security measures — all data is exposed.

  • Examples: Name, address, phone number (e.g. (John Smith, 123 Cherry Street, 555-555-5555), as well as an SSN, or government-issued ID
  • Effectiveness: none. Direct and indirect identifiers are fully intact.

Potentially Identifiable

This type of data doesn’t include direct identifiers like names or addresses, but can still be used to identify an individual.

  • Examples: unique device ID (like a MAC address), license plate number, medical record number, cookie, IP address
  • Effectiveness: minimal. Direct identifiers (like names or SSN) are partially masked, though indirect identifiers (like DOB or gender) are still intact. There may be some limited safeguards in place.

Not Readily Identifiable

This is the same as “Potentially Identifiable,” except that data are also protected by safeguards and controls.

  • Examples: a hashed MAC address, or legal representations
  • Effectiveness: minimal, a slight step up from Potentially Identifiable, only in that there are some controls or safeguards in place. In this case, direct identifiers are partially masked, though indirect identifiers are still intact. 

Pseudonymous Data

At this level, you have removed or transformed any direct identifiers, but indirect identifiers remain intact, risking individual privacy in the event of a data breach. The levels of Pseudonymous Data can be divided into “Key-Coded,” “Pseudonymous,” and “Protected Pseudonymous.”

Key-Coded

With Key-Coded data, you replace the identity of the individuals with a unique subject identification code (which isn’t derived from any info related to that individual), so there are no direct identifiers. This might include clinical or research datasets where only a curator retains the key. This also permits auditing by drug regulatory authorities when needed.

  • Examples: Jane Smith, diabetes, HgB 15.1 g/dl = Csrk123
  • Effectiveness: Moderate. Direct identifiers have been eliminated or transformed, but indirect identifiers remain intact. There are controls in place.

Pseudonymous

Pseudonymous data refers to the practice where personally identifiable information fields within a data record are replaced by one or more unique, artificial identifiers (pseudonyms). Data is considered pseudonymised when it doesn’t contain explicit personal data, but only unique references to it.

  • Examples: HIPAA Limited Datasets, “John Doe” becoming 5L7T LX619Z (a unique sequence not used anywhere else).
  • Effectiveness: Moderate. Direct identifiers have been eliminated or transformed, but indirect identifiers remain intact. There may be some limited controls in place.

Protected Pseudonymous

Protected Pseudonymous data is nearly the same as Pseudonymous, except for an extra step that data is also protected by safeguards and controls.

  • Examples: see “Pseudonymous”
  • Effectiveness: Moderate. Direct identifiers have been eliminated or transformed, but indirect identifiers remain intact. There are controls in place.

De-Identified

With these types of data protection, direct and any known indirect identifiers have been removed or manipulated, in order to break the linkage to real world identities. This would include everything from names and addresses to account numbers, health plan beneficiary numbers, and biometrics. 

The levels of De-Identified data can be divided into “De-Identified” and “Protected De-Identified.” While de-identification can be sufficient for some use cases, bear in mind that de-identifying data doesn’t guarantee protection from re-identification.

De-Identified

For data de-identification, the data is suppressed, generalized, swapped, etc.

  • Examples: GPA of 3.2 becoming 3.0-3.5, changing “gender: female” to “gender: male”
  • Effectiveness: Somewhat High. Both direct and indirect identifiers have been eliminated or transformed, but there are little to no safeguards or controls in place.

Protected De-identified

Protected De-Identified is the same as De-Identified, except that data is also protected by safeguards and controls, adding an extra layer of protection.

  • Examples: See “De-Identified”
  • Effectiveness: High. Both direct and indirect identifiers have been eliminated or transformed, and there are controls in place to protect it. However, all De-Identified data still runs some risk of being re-identified once accessed.

Anonymous Data

With Anonymous Data, direct and indirect identifiers have been removed, or manipulated together with technical and mathematical guarantees to prevent re-identification. This last step is key, because it eliminates the remaining issues unaddressed by most data privacy practices. The levels of Anonymous Data can be divided into “Anonymous” and “Aggregated Anonymous.”

Anonymous

With Anonymous data, steps are taken to hide whether an individual is present or not in the dataset, such as by introducing data noise into it (e.g. randomly generated variables).

  • Examples: differential privacy, such as with iOS keyboards: instead of registering what any individual is typing, Apple looks at what its users are typing collectively, in order to provide better suggestions.
  • Effectiveness: High. Both direct and indirect data have been eliminated or transformed, and safeguards and controls are no longer relevant due to the nature of the data. However, this method can damage conclusions for research (such as in clinical trials), as there is a tradeoff between privacy and accuracy when statistical noise is introduced to smaller data sets.

Aggregated Anonymous

Aggregated Anonymous data, as the name suggests, is very highly aggregated anonymous data. Not only does this mask the presence or absence of any individual, but it does so at a much larger scale. This allows entities to share (permanently and irrevocably) anonymous, de-identified data, while still achieving the same data-sharing needs — with no risk of violating regulations like HIPAA or GDPR.

  • Examples: statistical data, census data, population data like “52.6% of Washington, DC residents are women,” or data provided by TripleBlind’s software solution.
  • Effectiveness: Very High. Both direct and indirect data have been eliminated or transformed, and safeguards and controls are no longer relevant due to the high degree of data aggregation.

Though there is no binary categorization of “identifiable vs. non-identifiable,” you can think of de-identification in terms of the degree to which the data has been de-identified, and the level of safeguards or controls in place to protect access to it.

Stronger de-identification methods will eliminate or transform not only direct identifiers, but indirect identifiers too, which could otherwise be used to identify an individual. The full scope of what classifies as an “indirect identifier” is growing however, and traditional de-identification methods aren’t always sufficient. The most effective forms of de-identification ensure that data can be collaborated on while rendering re-identification impossible.

The TripleBlind Solution provides a more comprehensive approach to de-identification, addressing key limitations associated with de-identification and anonymization. TripleBlind’s privacy-enhancing technology offers the following benefits:

  • TripleBlind supports security and privacy for all parties. Our technology allows for multiple data partners to collaborate on sensitive data in one-way encrypted space, safeguarding data partners’ sensitive information and processing algorithms.
  • TripleBlind offers the ability to process complex data. While many privacy-enhancing techniques are focused on text, our technology allows for the processing of images, video, voice recordings, genomic, and unstructured data.
  • TripleBlind allows data gathering organizations to maintain possession of their data at all times. Our system uses one-way, one-time encryption. This ensures data can only be used by authorized parties. Our technology allows organizations to keep their data in-house, addressing data residency issues.

We’re excited to share healthcare data security best practices and solutions in our healthcare data whitepaper. If you’d like to learn more about how the TripleBlind Solution offers next-generation security and privacy, contact us today to schedule a demo.