Data-Centric AI Data Quality Framework — Making Data Quality Systematic
For years, improving artificial intelligence algorithms was seen as the way forward when it came to optimizing AI performance, but this legacy approach is quickly becoming outdated.
According to prominent AI expert Andrew Ng, trying to improve an AI system by focusing on the model results in marginal gains, if any. This is partly due to increased commodification and availability of well-performing AI models. The way forward, Ng argues, is a data-centric approach to AI, but this approach isn’t without its challenges.
One of the biggest challenges facing a data-centric approach to AI is developing a solid framework for data quality. When talking about machine learning, we often refer to this idea of needing large datasets containing millions of records, but many available training datasets only include several thousand examples or less. To be clear, many real-world applications or potential applications of AI aren’t likely to involve training datasets with examples in the millions.
However, it is difficult to quantify the most representative size for an AI training dataset. Ng argues that small and mid-sized data sets are sufficient to train a good AI system, but we need to transition from “big data” to “good data.”
When it isn’t possible to amass a sizable data set, Ng says, AI developers should focus on collecting data that is defined consistently, covers critical cases, includes timely feedback, and is representative. “Good Data” includes factors that are domain- or application-specific, such as healthcare data that adheres to privacy regulations.
Essentially, the shift from big data to good data will require systematic approaches to data quality.
The Key to a Data-Centric AI Data Quality Framework
If data-centrism in AI development requires high quality data, then developers must ensure that data points are labeled consistently. If different labelers use different labeling conventions, it hurts the algorithm’s ability to learn –– thereby affecting the quality of the model’s output.
A systematic approach to labeling recently championed by Ng starts with two subject matter experts independently labeling a dataset. After the dataset has been labeled, the consistency between labelers is quantitatively measured to discover where they agree or disagree. In instances where labelers disagree, the labeling instructions are revised until the labelers perform consistently with each other.
A Data-Centric AI Data Quality Toolkit
Labeling training data for real-world use cases requires the enlistment of subject matter experts, but often, these experts are busy with their primary responsibilities and have little time they can dedicate to manually labeling data points. This is particularly true for subject matter experts working in time-intensive fields like healthcare. Additionally, data and objectives can change after the deployment of an AI system. For example, a new study on MRI imaging could cause a reassessment of an AI system for medical imaging. When something like this happens, new training data may have to be re-labeled.
For massive datasets, the need to manually label and relabel can be a non-starter for most organizations. One automated data quality toolkit being developed by Snorkel AI is called programmatic labeling. In a simple example involving a text-based AI system, a subject matter expert would write out a few key phrases that are then used to iteratively label data points via labeling functions. The company says it is currently developing its programmatic platform for “rapid, data-centric, iterative AI development. In other words, it revolves around modifying, labeling, and managing your data.”
The goal of the Snorkel platform and others like it is to give AI developers the ability to scale up using unlabeled data as rapidly as it would be using labeled data. While skeptics may be concerned about issues like labeling errors and bias, Snorkel says its approach to data-centric AI data quality and reliability has been empirically proven to save person-months, at or above quality parity, in more than 50 peer-reviewed publications.
Supporting Data-Centric AI Data Quality and Safety with TripleBlind
One of the biggest obstacles to data-centric AI development is access to data. TripleBlind’s innovative privacy-enhancing technology helps developers break down obstructive data silos for increased access to valuable datasets. With our innovative TripleBlind Solution, developers can access sensitive data, including healthcare and financial data, while keeping individual privacy sacrosanct and remaining compliant with data privacy regulations like GDPR and HIPAA.
If your company is moving toward a data-centric approach to AI, you should find out how TripleBlind can facilitate the switch.