Data Problems: Data Prep, Data Bias, and Safe Distribution
Data is being produced at an unprecedented scale, and while companies can clearly benefit from this, it is important for anyone looking to analyze data to be aware of the problems that must be addressed. Companies that work with machine learning and analytics are particularly challenged by a host of data issues — including data access, data prep requirements, data bias and compliance.
A 2019 report from Dun & Bradstreet found three main problems related to using data: processing data, ensuring accuracy, and protecting privacy. The report included a survey of more than 500 American and British business leaders, and one-fifth of respondents said they lost a client due to data problems while almost one-quarter of respondents said their financial forecasts have been wrong. Clearly, there is room for improvement in the areas related to data processing, accuracy, and privacy. Below are a few ways in which companies can address them.
Data Prep
In modern business, data used for analytics is often third-party data, collected from outside sources. The result of collection efforts is often data in a variety of formats. Third-party data may be incomplete or contain errors. Sometimes sections of data collected from a third party do not pertain two the analysis being performed.
A strong data prep workflow can address many of the issues that are inherent to the collection of data for analysis. This process should be designed to produce data that is clean, consistent, complete for purpose, current, and with essential context.
The first step in any data prep system is data profiling — which involves preliminary examining and summarizing of data in a way that highlights any quality issues. Specific data profiling tools such as Quadient DataCleaner and DataMatch Enterprise have been designed to access datasets for cleanliness, completeness, and consistency. When done properly, data profiling sets the stage for future data prep steps by highlighting any potential issues within a dataset.
After data profiling has been performed, data issues must be addressed based on the expected use case and business-related factors, such as agreements with data providers and associated costs. Even after steps are taken to address issues, data profiling should be considered an ongoing endeavor that can identify any emerging problems within the dataset.
After initial quality-related issues have been identified and addressed, the next step in the data prep process is enrichment. This step typically involves determining business metrics and key performance indicators. It also involves filtering data according to relevant business factors and augmenting data using additional sources.
Because the enrichment process can involve many different — and often subjective — factors, it can be a complicated process. The process can be simplified by clearly identifying business needs and the main purpose of analysis. These foundational principles should be used to identify metrics for the enrichment and potential additional sources of data. Finally, data filters, business factors, and calculations must be identified and then used to produce enriched data.
Data Bias
Because computers lack human emotions, we tend to think of them as free from human bias. But ironically, computer technology and artificial intelligence can be more biased than humans. This is because computers, their inputs, and their applications are created by humans and are capable of acting on any inherent biases with far more speed and power than humans can.
Data analysis is based on the collection of data, and data is often collected by human-built systems. if data collected is not representative — such as patient data from wealthier or demographically homogeneous groups — it will be biased. Because bias is a subconscious part of our own thought processes, it’s common for these biases to slip into data collection processes. Furthermore, identifying something as “biased” is often a subjective exercise in and of itself.
Data analysis techniques can also introduce bias in ways that negatively impact results. For instance, an analysis could be set up with confirmation bias, which is the idea that a preconceived notion can affect how data is considered or results are interpreted. Selection bias is another type of predisposition that involves selecting data sources that don’t represent the target group.
Another type of bias is the poor interpretation of outliers. An example of this is often seen when discussing average salaries for a position — where one or two people earning high salaries can significantly skew the average pay rate. Data bias can never be completely eliminated but steps can be taken to identify and address it. This starts with the recognition that bias exists, both in the data and in the analysis.
One step toward addressing bias in artificial intelligence systems is to ensure there is adequate context being provided for analysis. Another key step is to ensure AI models are trained with business impact in mind, as opposed to purely focusing on accuracy. This prevents models from reaching conclusions that are accurate, but unrealistic from a business perspective.
Data Distribution
Teams and organizations that work with data are operating in an increasingly distributed environment. Data is often shared among different groups, and this presents its own set of challenges. Furthermore, data analysts often need access to datasets that are obfuscated by regulations and bureaucracy.
One of the biggest challenges related to the safe distribution of data is remaining compliant with various government regulations. In the United States, for instance, the Health Insurance Portability and Accountability Act places strict limits on how personal healthcare information can be shared. In Europe, the General Data Protection Regulation (GDPR) places strict limits on how any type of personal information can be shared.
Remaining compliant with these laws and ones like them starts by being informed and knowledgeable. Ignorance is not an excuse for being found non-compliant, and the potential penalties can be devastating to both organizations and individuals. Legal firms specializing in compliance should be consulted regularly in any data distribution situation.
Another major issue related to data distribution is accessibility. Data is often locked away in silos and this compartmentalization can seriously inhibit the ability to collect and analyze data. Internal silos can be addressed by creating a unified data system. External silos can be broken down through the use of secure, compliant data distribution tools.
Simplify Data Collaboration and Analysis with TripleBlind
The TripleBlind Solution radically improves the practical use of privacy preserving technology, by adding true scalability and faster processing, with support for all data and algorithm types.
While our solution can be applied across a wide range of use cases, it has been particularly valuable in the heavily regulated financial and healthcare industries. If your company is in the market for an innovative solution that compares favorably with other privacy preserving technologies, such as homomorphic encryption, synthetic data and tokenization, contact us today to learn more.