Can we stop AI from inheriting our biases? | Julia Mann | TEDxRWTHAachen
Julia Mann, working in AI, explains that AI bias stems from imperfect, often uncurated human-generated data and the inherent biases of its human developers. She argues that while eliminating bias is difficult because AI seeks patterns (disregarding outliers), conscious collective effort to provide diverse, unfiltered real-world data can help move datasets toward a uniformly distributed state.
## Speakers & Context
- Julia Mann — individual working in the field of AI; not an expert in AI ethics.
- Context is an educational talk aiming to explain AI bias and methods to reduce it.
## Theses & Positions
- Bias is an inherent human trait, manifesting as a tendency or preference for something or someone.
- AI bias exists not because mathematics is biased, but because AI relies on biased data.
- The development of AI bias is rooted in three areas: how data is collected, historical/societal changes, and human behavior/culture.
- The most severe consequences of AI bias are seen in real-world impact, such as false criminal convictions.
- Reducing bias requires moving from data sets with "extreme mathematically seen tails" to "uniformly distributed" and diverse data sets.
## Concepts & Definitions
- **Bias (General):** Having a tendency or a preference for something or someone.
- **Conscious Bias:** Awareness of one's own preferences or biases.
- **Subconscious Bias:** Biases exhibited without conscious awareness.
- **Availability Bias:** The tendency for previous thoughts or experiences to unduly influence current thoughts (demonstrated via Google search results).
- **Uncurated data:** Data from the publicly available internet that is raw and unfiltered.
- **Uniformly distributed data set:** A data set that lacks bias and is highly diverse.
- **Outliers:** Data points that do not occur frequently enough and are often disregarded by AI algorithms.
## Mechanisms & Processes
- **Language Gender:** Languages like German (*ingenieur* for male, *ingenieurin* for female) are gendered, while English is natural gender (nouns lack gender).
- **AI Training:** Large language models (LLMs) are trained on massive, publicly available datasets.
- **Data Collection Bias:** Bias can enter datasets based on who collects the data, where it's collected from, and the collection method itself.
- **Algorithm Function:** AI algorithms seek patterns by identifying the most frequent and significant occurrences, thereby disregarding outliers.
- **Bias Mitigation Goal:** The process of moving a dataset from containing biased extreme tails to a uniformly distributed state to reduce AI bias.
## Timeline & Sequence
- **Pre-AI:** Bias demonstration using *ingenieur/ingenieurin* comparison and the bride image experiment.
- **AI Application:** Repeating the image prompt experiment using a search engine on a phone.
- **Data Scale:** The current version of the LION dataset contains **6 billion data points**.
- **Bias Manifestation in Justice:** Since **2018**, at least **eight people** were falsely convicted using an AI-powered algorithm.
- **Regulation Status:** Regulations are pending, specifically mentioning the **AI act by the European Union**.
## Named Entities
- **Julia Mann** — speaker and AI researcher.
- **German and French** — examples of gendered languages.
- **Midjourney, Dall-E** — examples of large language models used for image generation.
- **LION dataset** — the specific, large dataset used to train these generative models.
- **European Union** — issuing regulations concerning AI.
## Numbers & Data
- **6 billion data points:** The current size of the LION dataset.
- **2018:** Starting year for the documented false convictions.
- **Eight people:** The documented number of people falsely convicted using AI since 2018.
- **Two plus two equals four:** Example of a factual, unbiased mathematical truth.
## Examples & Cases
- **Engineer Visualization:** Difference between imagining an engineer in English vs. German/French.
- **Bride Visualization:** Difference in the initial, unconscious mental image of a bride across different prompts.
- **Search Engine Results:** Initial images seen when searching "bride" online, mirroring previous personal biases.
- **Misappropriation in Justice:** The case where people of color were falsely predicted as criminals and incarcerated multiple days.
- **Data Curation Failure:** Deleting pornographic images from a dataset led to a subsequent male domination bias in the remaining images.
## Tools, Tech & Products
- **Search Engine (on phone):** Tool used by the audience to test image bias.
- **Large Language Models:** Generative tools like Midjourney and Dall-E that create images based on trained datasets.
- **AI algorithms:** The underlying mathematical systems that find patterns in data.
## References Cited
- **AI act by the European Union:** Legislation currently regulating the development and application of AI.
## Trade-offs & Alternatives
- **Data Cleaning Trade-off:** Attempting to remove unwanted content (pornographic images) resulted in the bias of male domination.
- **Best Case State:** Achieving a "uniformly distributed" data set, moving away from biased tails.
## Counterarguments & Caveats
- **Mathematical Determinism:** The argument that maths itself is factual and non-biased ("Two plus two is four").
- **Inherent Difficulty:** The speaker notes that fixing bias is "actually not that easy because of its nature."
- **Data Source Limitation:** The data is drawn from the "publicly available internet," which is inherently raw and unfiltered.
## Methodology
- **Comparative Visualization Experiment:** Comparing mental images prompted by language structure (English vs. German).
- **AI Performance Testing:** Using personal phones and search engines to demonstrate immediate search bias.
- **Algorithmic Analysis:** Identifying how pattern-seeking behavior causes outliers to be discarded.
- **Data Correction Theory:** The theoretical goal of transforming a biased data distribution into a uniform one.
## Conclusions & Recommendations
- The ultimate responsibility for bias reduction lies with the collective human contribution of diverse, authentic data.
- When uploading data to LLMs, contributors should ensure it is "real data without any filter with red spots in your face, if that’s how it is, with no makeup, if that’s how it is."
- The goal is to show AI "how authentic and diverse we humans are."
## Implications & Consequences
- If AI is used for high-stakes decisions (like criminal justice), its known biases pose severe risks, as seen with false arrests.
- AI cannot inherently correct for the societal biases embedded in its massive training data derived from the public internet.
## Verbatim Moments
- *"I am biased, but I’m fine with it."*
- *"English is a natural gender language, which means the nouns don’t have a gender."*
- *"We all are biased somehow."*
- *"availability bias, which means that my previous thoughts, they influence my thoughts."*
- *"Bias can arise from the way we collect the data. So who we collect it from, where we collect it from, who collects the data?"*
- *"We don't want to have to have that in it."*
- *"it's actually not that easy to fix the bias in AI. A data set that has bias, has some extreme mathematically seen tails, we call it, and a data set that does not have bias is called uniformly distributed."*
- *"But together we can show AI, how authentic and diverse we humans are."*