The world is filled with valuable information, but organizations must properly handle data from collection to storage to usage. Failing to follow the necessary procedures can result in illegal violations of individual privacy.
When people or organizations want to use or share sensitive information, they will turn to manual de-identification, redaction, and/or anonymization. These techniques are focused on preventing the release of personally identifiable information (PII), which is any information that can be used to identify a specific individual, such as names or Social Security numbers. These privacy-enhancing techniques are also often used to remove personal financial information, such as a credit card number or bank account number.
Organizations that handle PII are legally responsible for protecting it from unauthorized access and use, typically by way of manual data de-identification, redaction, and/or anonymization. These methods may sound similar, but each one is considered to be a distinct process.
Personally Identifiable Information
According to the Department of Homeland Security, PII is any kind of information that can be used to identify a singular individual. PII includes names, addresses, Social Security numbers, phone numbers, images, videos, biometrics, and in some cases IP addresses.
For legal purposes, this definition applies to both the public and private sectors.
What Is PII Redaction?
Anyone who has seen the publicly released versions of sensitive legal or government documents has seen redaction at work. In these documents, redacted information, such as names of informants or victims who wished to remain anonymous, is covered by black markings or boxes.
The purpose of redaction is to eliminate PII from specific documents, videos, or other media so that they can be shared or released to the public. This allows for the use or consumption of important information without revealing the identities of entities that wish to remain anonymous, possibly for safety, business, or legal reasons.
Redaction is more than going over a sheet of paper with a large black marker or placing large black boxes over the sensitive parts of video footage. It is critical for the editor to know which pieces of information must be redacted.
A proper redaction process requires a strong understanding of the applicable privacy laws and the subject matter of the record to be redacted. Every redaction situation is unique. Information that must be redacted in one situation could be perfectly acceptable for release in another situation.
While redaction is extremely important in specific cases, it is extremely time consuming, reducing the quality and quantity of data. Hence, redaction is not a particularly useful process with respect to aggregating, storing, and processing large amounts of data.
What is Manual de-identification of PII?
De-identification is the process by which PII is removed from a personal data record, such as a hospital’s patient records. Before a dataset containing PII can be used, de-identification must be performed in order to remain compliant with privacy laws.
Automatic de-identification scripts are frequently used to remove PII from records. However, scripts may not catch every piece of PII, potentially skipping over unusual names, addresses, or any information that could be traced back to a specific person. Additionally, there may be sensitive information stored in an unexpected field, which may not be detected by the script. Therefore, a laborious manual de-identification step is often used to double-check that a dataset has been properly de-identified.
In a typical manual de-identification process, sensitive data records are scanned for PII. When a bit of PII is found, it is replaced with an appropriate de-identifying tag. For example, the tag “” might be used to replace first and last names found within a record. For this process, it is best to divide work among multiple team members, if possible, because it can take a fair amount of time, depending on the size of the dataset.
It is important to note that deidentified records are not completely anonymous. Avoiding complete anonymity can be useful in some situations. For example, if records collected before a certain date must be deleted, it is possible with de-identified records that include a date of collection, but not with completely anonymous data that has no collection date.
What is PII Anonymization?
In an anonymization process, a dataset containing PII is used to create a second dataset with PII removed that is built to resist the re-identification of a single individual contributor.
This is different from de-identification in that; de-identified records for individual contributors can still be used to easily identify individuals based on non-PII. For example, de-identified hospital records may still have information on when a patient was treated and the results of their diagnosis at the time. Using complex mathematics and encryption techniques, anonymization is meant to avoid simplistic re-identification.
There are a number of popular anonymization techniques and tools, including:
- Aggregation/K-Anonymity
- Differential Privacy
- Hash Functions
- Noise Addition
- Substitution/Permutation
- Tokenization
Aggregation/K-Anonymity
When an anonymized dataset is created, PII is generalized by representing it as a group or range. For example, an income of $42,500 would become a range of $20,000 – 50,000.
Differential Privacy
Statistical noise is added to the dataset in a way that increases privacy but at the cost of utility. For example, a small number of individual contributions to a dataset may be randomly swapped with generated noise. When the right balance is struck, the dataset has acceptable levels of both privacy and utility.
Hash Functions
PII is replaced with fixed-size artificial codes that appear nonsensical. For example, a value of “Joe Smith” is placed into a non-reversible algorithm that then generates a bit string of “Hnd10NIDw9iiF.” Read more about these techniques in our blog post, “How TripleBlind Compares to Tokenization, Masking, and Hashing.”
Noise Addition
PII is expressed inaccurately. This involves the minor addition or subtraction of values to the original value.
Substitution/Permutation
PII is shuffled within a table or substituted with random values. For example, the name “Joe Smith” on one record might be swapped with the name “Zachary Zimmer” from another record.
Tokenization
PII is replaced with a non-sensitive token that links traces back to the original data. For example, the zip code “14201” might be replaced with a token of algorithmically-generated characters that can be used to later unlock the original value.
After a dataset has been anonymized, it is difficult for anyone handling it to identify an individual contributor based on the contents of individual records. Anonymization allows a data holder to share a complete dataset with a significantly minimized risk of privacy violations.
However, anonymization is not a foolproof solution. For example, it is possible to de-anonymize data by identifying correlations within multiple datasets. It is also possible to reverse engineer the anonymization technique, unlocking the data that was processed by it. Read more about tokenization.
A Paradigm Shift in De-Identification
While de-identification and anonymization techniques provide some privacy and security, additional measures can make for a more comprehensive approach.
TripleBlind can address many of the limitations associated with de-identification and anonymization. PII safeguarding technology from TripleBlind offers the following benefits
- TripleBlind supports security and privacy for all parties. Our technology allows for multiple data partners to collaborate on sensitive data in one-way encrypted space, safeguarding data partners’ sensitive information and processing algorithms.
- TripleBlind offers the ability to process complex data. While many privacy-enhancing techniques are focused on text, our technology allows for the processing of images, video, voice recordings, genomic, and unstructured data.
- TripleBlind allows data gathering organizations to maintain possession of their data at all times. Our system uses one-way, one-time encryption. This ensures data can only be used by authorized parties. Our technology allows organizations to keep their data in-house, addressing data residency issues.
TripleBlind’s solution addresses a wide range of use cases, allowing for the secure collaboration around sensitive data. If you would like to learn more about how Blind Compute offers next-generation security and privacy, contact us today to schedule a demo.
Book A Demo
TripleBlind is built on novel, patented breakthroughs in mathematics and cryptography, unlike other approaches built on top of open source technology. The technology keeps both data and algorithms in use private and fully computable.