How to Overcome Bias in Big Data
There’s been an explosion of interest in Big Data for digital health platforms in recent years, particularly as so much more data gets digitized.
This transition means more information is available to data scientists to find novel insights, drive clinical decision making, and to support or improve how we deliver healthcare.
However, big data bias is woven into this process from start to finish.
Suraj Kapa, SVP of healthcare and government at TripleBlind, recently hosted a webinar with a team of expert panelists, with the goal of providing an understanding of bias in big healthcare data. This includes learning what bias in data can look like, understanding the related issues (especially in regards to digital health platforms), and the scalable deployment of digital health tools to society in general.
This last part is becoming an increasingly important issue, especially as the resources and technology in the industry have started to mature enough to allow for this kind of deployment on a larger scale.
Keep reading to learn a few of the highlights, or watch the webinar for the full discussion.
Are you overestimating your ability to detect big data biases?
Most people, data scientists included, have limited knowledge of just how easily datasets can become biased, or where that bias can come from.
When considering this, some key questions need to be addressed: where’s the data coming from? How is it being used? How is it being integrated? All of this factors into the potential imposition of bias into these processes, which has the potential to actually widen the disparities of care that might exist.
Big data bias can hit every part of the data scientist’s world. It starts with the data gathering bias, which is easy enough to think about – are we fully represented in the training dataset, or in the validation dataset to approve the algorithm (such as the FDA or regulatory authority)?
This might be solved by having diverse enough datasets, but there’s also data analysis biases, and data application biases, both of which are critical to address.
Here are a few considerations we need take into account:
The best analysis is limited to a fortunate few players
Data has helped concentrate power within the digital economy. Take for instance the number of institutions (especially in healthcare) that have mature data platforms — where the data’s truly integrated, and silos have been broken down within the institution. These represent a small few of a much larger ecosystem where the data actually exists. This has the potential to create a big focus on those who have reached the level they need to with their data platforms, where they can enable their data to translate into the digital economy.
Spurious correlations are becoming a bigger problem
The data economy is changing our approach to accountability, from one based on direct causation to one based on correlation. As scientists, we don’t just want correlation: we want causation so we can do further studies and further analyses; to tweak or remove variables and see if it shifts results one way or the other.
When drawing a conclusion, we want to understand that this is a causal factor as opposed to a correlative factor.
But when we assume causation from correlation, all we have is this massive dataset with limited explainability. So if there’s bias in how we obtain the result, that’s going to magnify into the algorithm, and this can potentially lead to worse care outcomes, especially in specific populations where there’s not as much data, or they’re not as well represented.
It’s important to remember that we sometimes assume if we just have enough data, we’ll be free of bias. But the reality is, data systems often mirror the existing social world.
Take, for example, gender bias in one of Amazon’s algorithms.
Amazon had an AI algorithm where they looked at resumes, because they wanted to quickly identify a good resume versus a bad resume. But the algorithm tended to overemphasize masculine words in the resume, and downgraded anything that had to do more with female counterparts. This led Amazon to shut down the algorithm in 2017.
This is a prime example of the type of problematic correlation that can affect our algorithms if we’re not careful. Remember, humans determine which data is captured. We create algorithms and assemble datasets, and since humans are biased, this bias is naturally built into the process.
These problems are subtle, and potentially very hard to measure or account for when you’re trying to avoid bias. Beyond gender, there are other clear factors such as demographic bias, ethnicity bias, and socioeconomic bias.
Imagine what this might look like in healthcare. If you go to Iceland and you build a novel genetic disease algorithm, you could probably apply this to patients in Nigeria or India or China, but first you need to figure out whether you’ll need an additional step of validation to ensure the algorithm is truly applicable. Otherwise, you risk creating worse health outcomes in patients, because you’re over- or under-representing disease risk.
The application of biased algorithms
We need to consider the application of these algorithms once they’re approved, and once they’re delivered through digital platforms
These sources of bias exist throughout the data life cycle, even from the funding that led to the creation of the data assets that you’re using.
Say you created algorithms based on every clinical trial participant that’s ever been in a clinical trial globally, because of course that data’s very robust, right? It’s not perfectly clean, but more clean than real-world data. It seems reasonably representative of the population in general, and you can get enough patients to have ample data.
The problem is, there’s very specific types of people who tend to engage in clinical trials, and they’re not reflective of the larger portion of the population.
So how can we solve for this? This is what we get into with the panel.
Solving for data bias issues: panelist discussion
In the webinar, we brought on expert panelists to discuss these issues. (We’ve paraphrased and truncated these quotes for clarity and brevity). Watch the webinar for the complete explanations from each panelist.
Insights from Aashima Gupta (Director of Global Healthcare Solutions at Google Cloud)
On the cloud industry’s role in understanding how to scale appropriately to mitigate the bias issues.
There are three key things to emphasize. The first is having a common set of principles, making sure products and partnerships go through that AI review consistently, in a repeatable fashion.
Across Google, we work with AI using a common set of principles. There are eight of them and they govern all of our work, including in healthcare, but they are common across all domains. Much of our product development work goes through that repeatable process of applying the AI principles in the context of product, and the product in the context of the customer work
The second is around explainability, and sharing that as a community resource.
Consider the nutritional content in our food or medication. Regardless of the use case, we rely on information to make responsible decisions. But what about AI? Despite its ability and potential to transform so much of the way we work and live, machine-learning models are often distributed without a clear understanding of how they function.
So what we have built in that context is a framework called a model card or data card, and we share a common vision with the industry. Model cards are not a Google product. It is a framework that we’ve shared with the industry to define the explainability of the model.
Say for example we have a data model that performs consistently across a diverse range of people. That data model, in the framework that we have built, has come from years of research within Google, and helps define the conditions in which the model breaks, showing what data element has gone in to affect it. Much like tying it back to the nutrition label: what is a protein, sugar, a carbs? Same thing. What is the diversity of the data?
And the third emphasis is about machine learning operations.
When building the model, you need to think about AI operationalization. There’s a significant bottleneck in the effective operation of machine-learning platform building. Only one in two organizations have moved beyond pilots and proof-of-concept at scale. So how do you make an adjustment when a new set of data comes in? How does that change your machine learning data pipeline?
This problem is much bigger than Google alone, so this is where the partnership with academia, with the customers, with the ecosystem on large will be important.
Insights from Brian Anderson (Chief Digital Health Physician for MITRE)
On the cross section between industry, academia, and government to establish what should be the standards.
A lot of my work centers around guardrails. We recently started a coalition called Coalition for Health AI, or CHAI, and it brings together key stakeholders (academia, industry, and public sector government) around this common mission to set standards or guardrails, and promote the kind of trustworthiness and transparency into models and their applicability.
This is to further develop algorithms and models that are useful for all of us, and that are trained on data that is inclusive of all of us.
Part of building any kind of standards involves the coalition of a willing and critical mass of implementers. And so, in CHAI, what we are attempting to do is pull together, in the industry side (stakeholders like Google, like Microsoft), is to take those bodies of work that organizations are developing and publishing across the industry.
From there, we then look to develop a kind of agreeable framework that we can all as a coalition say, “yes, this is what we are going to move forward with to address data bias, or to address human application bias.” Or “this is how we are going to approach testability or promote transparency in algorithm development.”
And then to have a real technical framework that is implementable in industry.
And it’s that critical part of actually implementing these technical frameworks that then provides that iterative feedback into a standards process, and builds the kind of coalition and adoption curve that you like to see
These things tend to start with academia and private sector industry coming together. And the government sees its promise, and gets behind it. From there, they may be able to offer some input and some advice on some of the equities and concerns from a government standpoint.
Insights from Daniel Kraft, (Faculty & Sheriff for Medicine at Singularity University)
The VC industry leader perspective on digital health, and a framework for how people are looking to address bias issues.
Compared to late-stage organizations where the focus is more on guardrails and bias mitigation, the earlier stage private sector organizations have more of a “build fast, grow fast” mentality.
Often these are well-meaning guardrails from the past that haven’t operated so well. For example, HIPAA is a bit antiquated, and still it’s analog-era regulation in a digital connected age. We’ve seen patients die because we’re waiting for the HIPAA sign-off to get their EKG sent over from another hospital. So I think there’s probably some middle ground.
We still need to enable startups to, hopefully within good context and good faith, be able to hopefully collect, leverage, and learn from data, but also have the responsibility to do that in ways where they can optimally share it and understand where maybe there’s a role to educate startups, academics, and others, and even clinicians today, where the biases may occur.
I think none of us want more data as clinicians. We want the actionable insights and how those get presented again could start to inform us about the opportunity to maybe, with the patient in front of us, have them opt in, in appropriate ways, to share data if they’re a member of, let’s say, an unrepresented set, all the way to, again, where there might be bias, and give you that sort of little “check engine” light.
You’re in the path of managing a patient like you would the average European when they have a lot of other elements from their socioeconomic to their genetic determinants. So I think that the design piece is key, as well as how to engage people in sharing and opting in.
We have these long legal forms about sharing data, there’s all these new ways to manage it with, with blockchain and beyond. How do you explain that in smart ways to the folks who are asking to hopefully share and opt in to contribute some of their data and knowledge base?
I think perfection is also the enemy of the good here, and we need to sort of be building stacks and new approaches, and allowing folks to become data donors who often are unrepresented, so we can have less biased starting points.
Final thoughts about data bias
When we’re trying to abide by privacy and varying regulatory standards, a lot of times we’re forced (for a good reason) to extract extensive degrees of context from the data, and so the algorithms come from data that has removed context. Ideally, we’d like to maintain that context in those data sources.
We need to ensure secure, privacy-preserving, yet scalable approaches so we can collaborate on data broadly, to ultimately mitigate bias that emanates from limited data diversity.
While some elements of data diversity are measurable — like what proportion of your population is African American versus Hispanic American versus Indian or other populations — there are some immeasurable aspects that you might need larger datasets in order to ultimately account for, because. In other words, we don’t know what we don’t know.
We need to have a means to verify the level of reliability. This means improving methods to understand how well an individual is truly represented within a dataset. How do we create systems to understand the representation of a given new individual coming in, and to whom this algorithm is now being applied?
We can’t sit here and think that industry by itself, academia by itself, or government by itself are going to be able to — as a silo — figure out how to solve all of this. The creation of appropriate guardrails will likely require cross-sector consortiums.