The increasing adoption of machine learning and artificial intelligence, which are fueled by data, has driven a massive need for access to more and better data, including anything from consumer data to medical test results. While improving data collection and usage can be incredibly important — consider how using healthcare data can drive lifesaving innovations — privacy concerns must be proactively addressed, to avoid potential risks and liabilities.
The leaking of private data like medical outcomes and Social Security numbers can be devastating both for individuals and the companies charged with keeping this data safe.
A simple approach for protecting private data is known as anonymization. This approach involves the removal of fields containing identifying information from personal records, such as the removal of a name from a medical record. Unfortunately, this simple approach is rarely sufficient to ensure the privacy of individuals, as the remaining unique information may be used to identify an individual. For instance, a supposedly anonymous patient can often be identified from a nameless medical record that lists their very unique medical diagnosis and generalized location.
Differential Privacy
The term “differential privacy” describes a mathematically rigorous approach to maintaining individual privacy. It has been designed to ensure that individuals who allow their information to be included in a larger dataset will experience a negligible impact on their lives.
So, what determines if a system provides differential privacy — a technique derived from the work of various researchers applying algorithmic methods to the study of digital privacy, culminating around 2006? Consider this scenario. We have two datasets with thousands of personal records that must be analyzed, but one of those data sets is missing the record of one John Doe. Differential privacy means an analysis on each data set will produce insights independent of the inclusion of the specific John Doe record. Thus, it cannot be reasonably determined if John Doe’s record exists in either data set, which aids in maintaining his personal privacy.
In practice, differential privacy techniques reduce the risk of privacy violations occurring, allowing organizations to leverage large sets of sensitive data for commercial or research purposes by adding randomized “noise” to protect individual data. For example, differential privacy is used to power Apple keyboard suggestions: Instead of focusing on what any individual is typing, Apply is focused on collecting what its users are typing collectively, in order to provide better suggestions.
Differential privacy also allows for the automation of cloud-based, private data processing, enabling differentially private data sharing among organizations located around the world.
How It Works
Differential privacy techniques aim to ensure the privacy of individual contributors by strategically adding noise into the data set or the output of an analysis. The statistical noise helps to ensure privacy but does not prevent data consumers from conducting useful analyses on private data.
In one approach of differential privacy, stochastically-generated data points — in other words, random variables — are injected throughout a data set to serve as the “noise.” The amount of noise introduced into the data set has a direct relationship to the level of privacy, with the privacy parameter “ε,” called privacy loss or privacy budget, referring to the level of noise. The inclusion of noise also reduces the accuracy of an analysis.
Thus, the smaller the value of ε, the more noise and less accurate the results. However, the results of an analysis are less likely to be distinguishable based on any single record. Conversely, larger values of ϵ correspond to less noise and more accurate results. However, the results of analyses are more likely to be distinguished based on individual records.
Given the privacy-accuracy tradeoff, determining the value of ε for a particular application can be challenging. Setting the appropriate level of privacy loss should depend on the nature of the analysis and the sensitivity of the data.
This tradeoff is a significant point of consideration for those looking to implement differential privacy techniques into their data workflows and makes it difficult to form standards over appropriate values of ε, where more privacy leads to less utility.
Benefits of Differential Privacy
Differential privacy offers a number of valuable benefits, making it a useful method for analyzing sensitive data while maintaining some level of privacy. Some benefits include:
- It avoids some of the issues related to de-identification. Under differential privacy, all information is treated as identifying information. This can help to ease the challenge of determining the identifying nature of each data element for a set of records.
- It quantifies privacy loss. A key feature of differential privacy is the ability to quantify privacy loss. This allows for comparisons among multiple analysis techniques. Because privacy loss can be controlled, it allows for control over the tradeoff between privacy and accuracy.
- It is compositional. When two differentially private analyses are conducted on a dataset, calculating cumulative privacy loss can be achieved by simply adding the privacy losses for each analysis. This means privacy loss can be calculated and controlled over the course of multiple computations. Thus, complex differentially private algorithms can be created from smaller building blocks.
- It can offer a high degree of privacy. Because it includes supplemental information, differential privacy is not susceptible to linking attacks that are effective on de-identified data. It is also well-suited to avoiding privacy loss for groups of records, such as those of families.
- It is invulnerable to post-processing. A data set that has been made differentially private will remain differentially private over subsequent operations. Anyone working on a differentially private data set cannot make it less private without knowledge of the initial private data.
Drawbacks of Differential Privacy
As noted above, differential privacy involves a tradeoff between accuracy and ensuring privacy. This tradeoff is often unacceptable, especially when working in highly regulated spaces like healthcare. The more privacy added, the less usable the data becomes, which can hinder physicians’ and researchers’ abilities to diagnose patients or discover new insights. The approach leaves too much to the judgment of the human using the technology and does not eliminate the heavy reliance on trust.
Furthermore, differential privacy cannot be applied to all types of data. Real-world data can come in complex and unstructured forms. Differential privacy data sharing is useless when it comes to genomic, video, audio and other data types.
While differential privacy can be effective for very large datasets, it becomes less and less effective as datasets become smaller.
Differential privacy can be computationally demanding. The amount of computational resources needed in some cases can make it an impractical approach.
A Better Approach
TripleBlind’s encrypted-in-use approach avoids the issues associated with differential privacy.
The TripleBlind Solution does not involve removing or replacing data. This solution maintains data fidelity for precise computational outcomes. Its key features include:
- A superior ability to process sensitive data. In addition to processing text, images, video, and voice recordings, our technology enables the processing of complex genomic and unstructured data.
- Supporting security and privacy for all parties. Our solution allows for data partners to compute sensitive data in a one-way encrypted space, safeguarding both data and processing algorithms.
- The ability to allow data gathering organizations to keep their data. Our solution uses one-way, one-time encryption keys, with a new key generated for each access. This ensures data can only be used for authorized purposes. With companies keeping their data in-house, behind their firewall, it also addresses data residency issues.
TripleBlind’s software-only API addresses a wide range of use cases, allowing organizations to unlock more value from sensitive data while ensuring privacy and maintaining compliance. If you want to learn more about how TripleBlind’s solution offers a number of advantages over differential privacy, contact us today to schedule a demo.