Bias
Bias is complicated and nuanced. Research has identified a wide range of sociotechnical harms from bias across LLMs. We separate the issues associated with bias from other LLM issues due to the complex and charged nature of the problem.
Feb 2025: An Adviser to Elon Musk’s xAI Has a Way to Make AI More Like Donald Trump (paywall) The research discussed in this article talks about both measuring and manipulating “entrenched preferences and values expressed by artificial intelligence models— including their political views.”
Sep 2024: Linguistic Bias in ChatGPT: Language Models Reinforce Dialect Discrimination The study finds that models default to “standard” varieties of English; it also finds that model responses to non-“standard” varieties consistently exhibit a range of issues: stereotyping (19% worse than for “standard” varieties), demeaning content (25% worse), lack of comprehension (9% worse), and condescending responses (15% worse).
Aug 2024: Gender, Race, and Intersectional Bias in Resume Screening via Language Model Retrieval The study finds that White-associated names are favored in 85.1% of cases and female-associated names in only 11.1% of cases, with a minority of cases showing no statistically significant differences. Further analyses show that Black males are disadvantaged in up to 100% of cases.
Aug 2024: CALM: A Multi-task Benchmark for Comprehensive Assessment of Language Model Bias Introduces Comprehensive Assessment of Language Models (CALM) for robust measurement of social biases
Aug 2024: Decoding Biases: Automated Methods and LLM Judges for Gender Bias Detection in Language Models Validates that the LLM-as-a-Judge metric aligns with human judgement on bias in response generation.
Jul 2024: Bias and Fairness in Large Language Models: A Survey A comprehensive survey of bias evaluation and mitigation techniques for LLMs.
May 2024: Akal Badi ya Bias: An Exploratory Study of Gender Bias in Hindi Language Technology Presents the first comprehensive study delving into the nuanced landscape of gender bias in Hindi, the third most spoken language globally.
May 2024: GPT is Not an Annotator: The Necessity of Human Annotation in Fairness Benchmark Construction Concludes that GPT3.5-Turbo is not an appropriate substitute for human annotation in sensitive tasks related to social biases, and that its use actually negates many of the benefits of community-sourcing bias benchmarks.
Feb 2024: The Political Preferences of LLMs An analysis surrounding the political preferences embedded in LLMs. Additionally, the work demonstrates that LLMs can be steered towards specific locations in the political spectrum through Supervised Fine-Tuning (SFT) with only modest amounts of politically aligned data, suggesting SFT's potential to embed political orientation in LLMs.
Dec 2023: “Fifty Shades of Bias”: Normative Ratings of Gender Bias in GPT Generated English Text A dataset of GPT-generated English text with normative ratings of gender bias.
Jan 2023: The political ideology of conversational AI: Converging evidence on ChatGPT’s pro-environmental, left-libertarian orientation Prompting ChatGPT with 630 political statements and the nation-agnostic political compass test in three pre-registered experiments, we uncover ChatGPT’s pro-environmental, left-libertarian ideology.
Mar 2022: BBQ: A Hand-Built Bias Benchmark for Question Answering A dataset of question sets that highlight attested social biases against people belonging to protected classes along nine social dimensions. Finds that models often rely on stereotypes when the context is under-informative, meaning the model’s outputs consistently reproduce harmful biases in this setting.
Jan 2022: A Survey on Bias and Fairness in Machine Learning Investigates different real-world applications that have shown biases in various ways, and lists different sources of biases that can affect AI applications. Creates a taxonomy for fairness definitions that machine learning researchers have defined in order to avoid the existing bias in AI systems. Examines different domains and subdomains in AI showing what researchers have observed with regard to unfair outcomes in the state-of-the-art methods and ways they have tried to address them.
Aug 2021: REDDITBIAS: A Real-World Resource for Bias Evaluation and Debiasing of Conversational Language Models The first conversational data set grounded in the actual human conversations from Reddit, allowing for bias measurement and mitigation across four important bias dimensions: gender, race, religion, and queerness. An evaluation framework which simultaneously 1) measures bias, and 2) evaluates model capability after model debiasing. Results indicate that DialoGPT is biased with respect to religious groups and that some debiasing techniques can remove this bias.
Jan 2021: BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation Introduces the Bias in Open-Ended Language Generation Dataset (BOLD), a large-scale dataset of 23,679 English text prompts for bias benchmarking across five dimensions: profession, gender, race, religion, and political ideology. An examination of text generated from three popular language models reveals that the majority of these models exhibit a larger social bias than human-written Wikipedia text across all domains.
Nov 2019: The Woman Worked as a Babysitter: On Biases in Language Generation Introduces the notion of the “regard” towards a demographic, use the varying levels of regard towards different demographics as a defining metric for bias in NLG, and analyze the extent to which sentiment scores are a relevant proxy metric for regard. Builds an automatic regard classifier through transfer learning to analyze biases in unseen text.
Aug 2018: Examining Gender and Race Bias in Two Hundred Sentiment Analysis Systems Presents the Equity Evaluation Corpus (EEC), 8,640 English sentences carefully chosen to tease out biases towards certain races and genders. The dataset is used to examine 219 automatic sentiment analysis systems. Several of the systems show statistically significant bias; that is, they consistently provide slightly higher sentiment intensity predictions for one race or one gender.