ClosedLoop was recently recognized as winner of the AI for Health Outcomes Challenge by the Center for Medicare and Medicaid Services (CMS). The criteria for judgement were the overall accuracy of the model, the transparency of the decision criteria, and a demonstration that the models are not biased. In order to approach the last point, we conducted an extensive review of literature on the topic, and found the framework proposed by Paulus and Kent to be a comprehensive approach to understanding the topic. In this post, we’ll discuss a few important concepts introduced in the paper and how it informs an approach to ensuring your models adhere to high ethical standards.
The first concept worth understanding is decision polarity. When an individual being judged by an algorithm would prefer a “positive” label over the “correct” label, we refer to the decision as being polar. Most canonical examples, such as parol, bank loan, and job application decisions, are polar decisions. In the context of healthcare AI, most population health models fit this bill as well. Population health models tend to focus on identifying small subpopulations to assist with finite care management assistance. To contrast these types of decisions, non-polar decisions occur when the person applying the model, and the person to whom the model is being applied have a mutually aligned interest in the correct decision being made. An example of a non-polar decision would be using an algorithm to determine the correct medication dosage for a given patient. Here, both the physician and patient have a mutually aligned goal of finding a dose that is both effective and mitigates any negative effects associated with over medication.
Now that we’ve set that stage, we can begin discussing the difference between model bias and model unfairness. Model unfairness is typically discussed in the context of polar decisions, and is aimed at quantifying how the distribution of benefit adheres to one of many measures of algorithmic fairness. It is worth noting that outside of the example of the “perfect” model, it is not possible to satisfy multiple definitions of fairness at the same time. We have discussed our metric for fairness, Group Benefit Equality, and how we think it is an ideal measure of fairness for healthcare AI models.
Fairness must be measured, however, fairness measures offer little insight into how to better service communities that are not being well served by the model. In the context of AI, we refer to bias as issues related to model design, data, and sampling that results in measurably different performance of the model for different subgroups. It should be thought of as a more holistic approach to understanding how your model is altered by the dynamics affecting the subpopulation.
The starting point for algorithmic bias must always be an examination of the data itself. If the dataset is collected under unequal conditions, it is very unlikely that the resulting model will be able to overcome this problem. In sociology, they have a concept called “WEIRD,” in that most data is based on people who come from western, educated, industrialized, rich, and democratic backgrounds. The same issue can affect healthcare data, where access issues skew heavily along racial and socioeconomic lines. Most studies will also skew to median ages, as clinical trials have a vested interest in excluding individuals on the far ends of the age distributions.
Next is the concept of label choice bias. Oftentimes, individuals constructing models will predict proxy outcomes, particularly when thinking about constructing population health models. The most common articulation of this phenomenon is using price to predict how sick an individual is. Obermeyer et al examined the validity of this practice in a paper that demonstrated that by using cost as a proxy for health, the average black individual carried a significantly higher disease burden to qualify for care management than their white counterparts. In order to measure this, the team leverages the idea of subgroup validity, in which they examine calibration plots between the predicted outcomes and alternative proxies that also aim to quantify a patient’s health. A similar effect can occur if cost is used as a feature.
There is also extended room for discussion of race being used in algorithms as direct features themselves. Eneanya, Yang, & Reese’s paper shows the dubious origins in race correction in eGFR calculations. These calculations carry consequences in cases of treatment options, inclusion in clinical trials, and selection for transplants. They cite a framework for the inclusion of race in medial modeling that states its inclusion is only justified when the use confers substantial benefit, the benefit cannot be achieved through other feasible approaches, patients that reject race categorization are accommodated fairly, and the use of race is transparent. In the context of population health models, the underlying assumption typically is that racism affects patients’ health, and its inclusion is more justifiable than in cases of pure biology. Having said that, it is better to include measures such as those provided for in social determinants of health measures that provide low level measures included in causal mechanisms.
Having said all that, it’s reasonable to contrast the two concepts. Given that bias is a more comprehensive look at the performance and construction of the model, it would be natural to assume it’s the “better” approach. It turns out that it is possible to create a model that is “unbiased” but is still unfair. While it is possible to create a “fair” model that carries underlying bias, it is typically done by corrections to scores after the model results are calculated. Even after expending efforts to ensure the model is unbiased, corrections to fairness are typically merited.
Interested in reading more about bias and fairness in healthcare data science? Check out these blog posts: