Towards Trustworthy AI: Investigating Bias and Confidence Alignment in Large Language Models
Abstract
LLMs are increasingly integrated into critical fields such as healthcare, judiciary, education, etc, thoroughly evaluating their trustworthiness is becoming ever more essential. This research thesis presents a unified examination of two critical aspects of trustworthiness in LLMs: their self-evaluation of confidence and the subtler biases that shape their outputs. In the first part, we introduce the concept of Confidence Probability Alignment to scrutinize how LLMs’ internal confidence, indicated by token probabilities, aligns with the confidence they express when queried about their certainty. This analysis is enriched by employing diverse datasets and prompting techniques aimed at encouraging model introspection, such as structured evaluation scales and the inclusion of answer options. Notably, OpenAI’s GPT-4 emerges as a leading example, demonstrating strong confidence-probability alignment, signifying a step towards understanding and improving LLM reliability. Conversely, the second part addresses the nuanced biases LLMs exhibit towards specific social narratives and identities, introducing the Representative Bias Score (RBS) and the Affinity Bias Score (ABS) to quantify these biases. Our exploration into representative and affinity biases through the Creativity-Oriented Generation Suite (CoGS) reveals a pronounced preference in LLMs for outputs reflecting the experiences of predominantly white, straight, and male identities. This trend not only mirrors but potentially exacerbates societal biases, highlighting a complex interaction between human and machine bias perceptions.Collections
The following license files are associated with this item:
- Creative Commons
Except where otherwise noted, this item's license is described as Attribution-ShareAlike 4.0 International