Building Fairness into Machine Learning

Stephen DeAngelis

September 25, 2018

Numerous articles have been written discussing how bias can creep into machine learning (ML) solutions. Bias is never a good thing; especially when it affects peoples’ lives. The good news, according to Tony Baer (@TonyBaer), head of Big Data research at Ovum, is that fairness and bias are now on the radar of “sophisticated” machine learning companies.[1] He defines sophisticated companies as those having at least 5 years of machine learning experience. He cites the results of an O’Reilly Media survey that found a “surprisingly high proportion of respondents (almost half) among the more experienced ‘sophisticated’ group reporting that their organizations were at least aware and are starting to vet their models and data samples for bias, fairness, and privacy. Of course with the onset of GDPR, their organizations are probably not giving them a choice. And also, this group, being the most elite, should have more awareness — the open question is whether such awareness will spread as adoption spreads to the more general enterprise population.”

 

Why Fairness Matters

 

Yaroslav Kuflinski, a sales executive at Itransition, reports, “A county justice system in Florida deployed an ML algorithm to predict the chances of a repeat offense, and thus important to use this prediction among other criteria to determine the eligibility of inmates for parole. Here is a machine learning algorithm upon which the lives and hopes of a significant number of desperate incarcerated people depend. But sadly, the inaccuracy of this system has repeatedly granted parole to repeat offenders, and it has even been determined by an independent analyst group that racism has actually been built into the predictive algorithm!”[2] Whenever peoples’ lives can be seriously impacted, fairness matters. Four distinguished scholars, Ahmed Abbasi, Jingjing Li (@Vivianchat177), Gari Clifford (@GariClifford), and Herman Taylor, observe, “Machine learning is increasingly being used to predict individuals’ attitudes, behaviors, and preferences across an array of applications — from personalized marketing to precision medicine. Unsurprisingly, given the speed of change and ever-increasing complexity, there have been several recent high-profile examples of ‘machine learning gone wrong’.”[3] They provide a few prominent examples of such episodes:

“A chatbot trained using Twitter was shut down after only a single day because of its obscene and inflammatory tweets. Machine learning models used in a popular search engine struggle to differentiate human images from those of gorillas, and show female searchers ads for lower paying jobs relative to male users. More recently, a study compared the commonly used crime risk analysis tool COMPAS against recidivism predictions from 400 untrained workers recruited via Amazon Mechanical Turk. The results suggest that COMPAS has learned implicit racial biases, causing it to be less accurate than the novice human predictors.”

Although some people want to lay all of the blame on machine learning, Abbasi and his colleagues note, “When models don’t perform as intended, people and process are normally to blame.” They note bias can be generated in a number of ways. “Bias,” they write, “can manifest itself in many forms across various stages of the machine learning process, including data collection, data preparation, modeling, evaluation, and deployment.” They list four types of bias commonly associated with machine learning:

 

  • Sampling bias may produce models trained on data that is not fully representative of future cases.
  • Performance bias can exaggerate perceptions of predictive power, generalizability, and performance homogeneity across data segments.
  • Confirmation bias can cause information to be sought, interpreted, emphasized, and remembered in a way that confirms preconceptions.
  • Anchoring bias may lead to over-reliance on the first piece of information examined.

 

Michal Gabrielczyk, a Senior Technology Strategy Consultant at Cambridge Consultants, observes there is an important, on-going debate about the unintended consequences of machine learning that simply behaves unexpectedly, which results in damage or loss.[4] To avoid these consequences, he suggests, “ML development needs to abide by some principles which mitigate against its risks.”

 

How to Build Fairness into Machine Learning

 

Gabrielczyk indicates “It is not clear who will ultimately impose rules if any are imposed at all.” Nevertheless he suggests several principles, and Abbasi and his colleagues suggest several specific rules, developers can use to help them build fairness into their machine learning systems. Gabrielczyk’s principles are:

 

  • Responsibility: “There needs to be a specific person responsible for the effects of an autonomous system’s behavior. This is not just for legal redress but also for providing feedback, monitoring outcomes and implementing changes.”
  • Explainability: “It needs to be possible to explain to people impacted (often laypeople) why the behavior is what it is.”
  • Accuracy: “Sources of error need to be identified, monitored, evaluated and if appropriate mitigated against or removed.”
  • Transparency: “It needs to be possible to test, review (publicly or privately) criticize and challenge the outcomes produced by an autonomous system. The results of audits and evaluation should be available publicly and explained.”
  • Fairness: “The way in which data is used should be reasonable and respect privacy. This will help remove biases and prevent other problematic behaviors from becoming embedded.”

 

The five specific rules suggested by Abbasi, Li, Clifford, and Taylor are:

 

1. Pair data scientists with a social scientist. “Data scientists and social scientists speak somewhat different languages. To a data scientist, ‘bias’ has a particular technical meaning — it refers to the level of segmentation in a classification model. Similarly, the term ‘discriminatory potential’ refers to the extent to which a model can accurately differentiate classes of data (e.g., patients at high versus low risk of cardiovascular disease). In data science, greater ‘discriminatory potential’ is a primary goal. By contrast, when social scientists talk about bias or discrimination, they’re more likely to be referring to questions of equity. Social scientists are generally better equipped to provide a humanistic perspective on fairness and bias.”

 

2. Annotate with caution. “Unstructured data, such as text and images, often is generated by human annotators who provide structured category labels that are then used to train machine learning models. For instance, annotators can label images containing people, or mark which texts contain positive versus negative sentiments. Human annotation services have become a major business model, with numerous platforms emerging at the intersection of crowd-sourcing and the gig economy. Although the quality of annotation is adequate for many tasks, human annotation is inherently prone to a plethora of culturally ingrained biases.”

 

3. Combine traditional machine learning metrics with fairness measures. “The performance of machine learning classification models is typically measured using a small set of well-established metrics that focus on overall performance, class-level performance, and all-around model generalizability. However, these can be augmented with fairness measures designed to quantify machine learning bias. Such key performance indicators are essential for garnering situational awareness — as the saying goes, ‘if it cannot be measured, it cannot be improved’.”

 

4. When sampling, balance representativeness with critical mass constraints. “For data sampling, the age-old mantra has been to ensure that samples are statistically representative of the future cases that a given model is likely to encounter. This is generally a good practice. The one issue with representativeness is that it undervalues minority cases — those that are statistically less common. While at the surface this seems intuitive and acceptable — there are always going to be more- and less-common cases — issues arise when certain demographic groups are statistical minorities in your dataset. Essentially, machine learning models are incentivized to learn patterns that apply to large groups, in order to become more accurate, meaning that if a particular group isn’t well represented in your data, the model will not prioritize learning about it.”

 

5. When building a model, keep de-biasing in mind. “Even with the aforementioned steps, de-biasing during the model building and training phase is often necessary. Several tactics have been proposed. One approach is to completely strip the training data of any demographic cues, explicit and implicit. … Another approach is to build fairness measures into the model’s training objectives, for instance, by ‘boosting’ the importance of certain minority or edge cases.”

 

Even with the best intentions, bias can creep into machine learning results. Developers and users must be constantly vigilant in order to minimize deleterious effects.

 

Footnotes
[1] Tony Baer, “Taking the pulse of machine learning adoption,” ZDNet, 8 August 2018.
[2] Yaroslav Kuflinski, “The Highs and Lows of AI and Machine Learning,” Read IT Quik, 1 August 2018.
[3] Ahmed Abbasi, Jingjing Li, Gari Clifford, and Herman Taylor, “Make ‘Fairness by Design’ Part of Machine Learning,” Harvard Business Review, 1 August 2018.
[4] Michal Gabrielczyk, “How to develop machine learning responsibly,” Jaxenter, 11 July 2018.