Skip to content

RealHarm: Real-world failure cases of language models applications

Large language model deployments in consumer-facing applications raise significant concerns about potential harms and risks. While existing research primarily follows top-down approaches derived from regulatory frameworks and theoretical analyses, these methods may miss failure modes that emerge in real-world deployments. In this work, we introduce RealHarm, a dataset of problematic interactions with AI agents built from a systematic review of publicly reported incidents. Analyzing harms, causes, and hazards specifically from the deployer's perspective, we find that reputational damage constitutes the predominant organizational harm, while misinformation emerges as the most common hazard category. Finally, we test whether guardrails and content moderation systems could be effective at preventing the observed incidents, revealing structural limitations in these technical safeguards.

Authors

  • Pierre Le Jeune,
  • Jiaen Liu,
  • Luca Rossi,
  • Matteo Dora

Published

April 10, 2025

Introduction

As large language models (LLMs) continue to transform customer service, content creation, and information access, organizations face significant challenges in ensuring these AI systems operate safely and reliably. While most AI safety research focuses on theoretical frameworks derived from regulatory guidelines or academic perspectives, our new research takes a different approach by examining what actually goes wrong in the real world.

In our paper, we introduce RealHarm - a dataset and taxonomy built from systematic review of publicly reported incidents affecting AI conversational systems (McGregor, 2021). Rather than speculating about potential risks, we analyze documented failures to provide a grounded perspective on the practical challenges of deploying LLM applications.

The RealHarm Dataset

RealHarm consists of 134 annotated examples derived from over 700 real-world incidents reported in the AI Incident Database (McGregor, 2021) and other public sources. Each example contains:

  • The problematic interaction between a user and an AI system
  • A corrected “safe” version showing how the interaction should have gone
  • Source documentation and context about the AI agent
  • Annotation with applicable hazard categories

The dataset is specifically focused on text-based AI applications where:

  • The interaction is well-documented with credible evidence
  • The incident caused harm to the organization deploying the AI system
  • The exact conversation (or significant portions) is publicly available

By collecting these real-world examples, RealHarm provides organizations with concrete insights into what actually goes wrong when deploying language models, rather than just theoretical risks, in contrast to top-down approaches (Majumdar, 2024; Mazeika et al., 2024; Zeng et al., 2024).

What Actually Goes Wrong? A Taxonomy of Harms

Our analysis reveals three main dimensions of AI failures: organizational harms, deployment impacts, and causal factors.

Organizational Harms

Organizational Harms

We identified three primary categories of harm affecting AI deployers:

  1. Reputation Damage (87%) - The vast majority of documented incidents primarily resulted in reputation damage. In about 20% of these cases, the consequences were severe enough to force the AI system offline.

  2. Legal Liability (11%) - Issues like defamation claims, generation of illegal content, and misrepresentation of services created legal exposure for organizations.

  3. Financial Loss (2%) - Direct financial consequences were rare but could be substantial, such as Google Bard’s factual error that contributed to a $100B market value reduction for Alphabet Inc.

Deployment Impacts and Causes

Our analysis found that while 69% of incidents resulted in no operational changes, 12% led to complete system shutdown - a concerning figure that highlights the potential severity of AI failures.

Regarding causes, model flaws dominated at 76% of incidents, with intentional abuse (15%) and technical flaws (9%) accounting for the remainder. This distribution emphasizes that inherent limitations in language models themselves, rather than implementation issues or malicious attacks, constitute the primary risk factor for AI deployers.

Taxonomy of Hazards

Hazard Categories

Based on our dataset, we developed a practical taxonomy of nine hazard categories, distinguishing our approach from other taxonomies (Ghosh et al., 2025; Vidgen et al., 2023, 2024):

  1. Misinformation and Fabrication (33%) - Systems generating false or misleading information
  2. Interaction Disconnect (12%) - Responses misaligned with conversation context
  3. Operational Disruption (10%) - Integrity compromised through prompt injection
  4. Brand Damaging Conduct (9%) - Responses harming company reputation
  5. Unsettling Interaction (9%) - Creating user discomfort through inappropriate responses
  6. Bias & Discrimination (8%) - Exhibiting prejudice or stereotyping
  7. Criminal Conduct (7%) - Encouraging illegal or unethical behaviors
  8. Violence and Toxicity (7%) - Promoting harmful behaviors or using inappropriate language
  9. Vulnerable Individual Misguidance (5%) - Failing to properly handle potentially dangerous situations

This analysis reveals two critical insights:

  • Hallucination remains the primary challenge in production systems (Huang et al., 2025; Ji et al., 2023)
  • Less frequent intentional abuse vectors like prompt injection can cause disproportionately severe organizational harm

How Effective Are Current Guardrails?

We evaluated 10 different moderation systems across both commercial content moderation APIs and specialized guardrail solutions to determine how many real-world incidents would have been prevented by current technical safeguards.

Moderation System Performance

The results revealed significant limitations:

  • Commercial APIs (OpenAI Moderation (Markov et al., 2023), Perspective API, Azure Content Safety (Zarfati, 2023)) showed low false positive rates but detected only 10-50% of unsafe conversations
  • Specialized guardrail systems (LlamaGuard (Team, 2024), ShieldGemma (W. Zeng et al., 2024)) achieved moderate detection rates while introducing substantially higher false positives
  • Composite detection approaches like LLMGuard showed promise with high detection rates, but require significant calibration to reduce false positives

These limitations stem from three core challenges:

  1. Contextual understanding - Most systems struggle with multi-turn conversations where issues emerge from interaction rather than a single message
  2. Misinformation detection - Without access to ground truth, systems miss factual inaccuracies
  3. Domain-specific policies - Generic filters often fail to align with organization-specific requirements

Interestingly, when we evaluated state-of-the-art LLMs (Gemini 1.5 Pro, GPT-4o, Claude 3.7) as moderators, they demonstrated superior detection performance, suggesting that current LLMs possess the inherent capability to recognize problematic content when properly instructed.

Conclusion: Practical Implications for Organizations

Our research offers several practical insights for organizations deploying LLM applications:

  1. Hallucination remains the greatest risk - Misinformation and fabrication constitute approximately one-third of documented incidents, confirming this as the primary challenge despite significant research attention

  2. Reputational damage dominates business risk - While technical capabilities often dominate the AI conversation, the business risk is predominantly reputational (87% of harms)

  3. AI failures can force system shutdown - Over 10% of incidents resulted in complete system shutdown, highlighting the need for comprehensive incident response planning similar to practices in information security (OWASP, 2023; Strom et al., 2018)

  4. Technical guardrails are insufficient alone - Even state-of-the-art systems detect only a modest percentage of unsafe interactions while introducing significant false positives

By grounding AI safety research in documented incidents rather than speculative risks, the RealHarm dataset provides organizations with an evidence-based framework for risk assessment, testing, and governance prioritization (Majumdar, 2023; Pittaras & McGregor, 2022).

Bibliography

Ghosh, S., Frase, H., Williams, A., Luger, S., Röttger, P., Barez, F., McGregor, S., Fricklas, K., Kumar, M., Bollacker, K., & others. (2025). AILuminate: Introducing v1. 0 of the AI Risk and Reliability Benchmark from MLCommons. arXiv Preprint arXiv:2503.05731.
Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., & others. (2025). A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems, 43(2), 1–55.
Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y. J., Madotto, A., & Fung, P. (2023). Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12), 1–38.
Majumdar, S. (2023). AVID: AI Vulnerability Database. https://www.avidml.org/
Majumdar, S. (2024). Standards for LLM Security. Large, 225.
Markov, T., Zhang, C., Agarwal, S., Nekoul, F. E., Lee, T., Adler, S., Jiang, A., & Weng, L. (2023). A holistic approach to undesired content detection in the real world. Proceedings of the AAAI Conference on Artificial Intelligence, 37(12), 15009–15018.
Mazeika, M., Phan, L., Yin, X., Zou, A., Wang, Z., Mu, N., Sakhaee, E., Li, N., Basart, S., Li, B., & others. (2024). Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv Preprint arXiv:2402.04249.
McGregor, S. (2021). Preventing repeated real world AI failures by cataloging incidents: The AI incident database. Proceedings of the AAAI Conference on Artificial Intelligence, 35(17), 15458–15463.
OWASP. (2023). OWASP Top 10 for Large Language Model Applications. https://owasp.org/www-project-top-10-for-large-language-model-applications/
Pittaras, N., & McGregor, S. (2022). A taxonomic system for failure cause analysis of open source AI incidents. arXiv Preprint arXiv:2211.07280.
Strom, B. E., Applebaum, A., Miller, D. P., Nickels, K. C., Pennington, A. G., & Thomas, C. B. (2018). Mitre att&ck: Design and philosophy. In Technical report. The MITRE Corporation.
Team, L. (2024). Meta Llama Guard 2. https://github.com/meta-llama/PurpleLlama/blob/main/Llama-Guard2/MODEL_CARD.md
Vidgen, B., Agrawal, A., Ahmed, A. M., Akinwande, V., Al-Nuaimi, N., Alfaraj, N., Alhajjar, E., Aroyo, L., Bavalatti, T., Blili-Hamelin, B., & others. (2024). Introducing v0. 5 of the ai safety benchmark from mlcommons. arXiv Preprint arXiv:2404.12241.
Vidgen, B., Kirk, H. R., Qian, R., Scherrer, N., Kannappan, A., Hale, S. A., & Röttger, P. (2023). SimpleSafetyTests: a Test Suite for Identifying Critical Safety Risks in Large Language Models. arXiv Preprint arXiv:2311.08370.
Zarfati, F. (2023). Introducing Azure AI Content Safety: Helping Organizations to Maintain Safe Online Spaces. https://azure.microsoft.com/en-us/services/cognitive-services/content-moderator/
Zeng, W., Liu, Y., Mullins, R., Peran, L., Fernandez, J., Harkous, H., Narasimhan, K., Proud, D., Kumar, P., Radharapu, B., & others. (2024). Shieldgemma: Generative ai content moderation based on gemma. arXiv Preprint arXiv:2407.21772.
Zeng, Y., Yang, Y., Zhou, A., Tan, J. Z., Tu, Y., Mai, Y., Klyman, K., Pan, M., Jia, R., Song, D., & others. (2024). Air-bench 2024: A safety benchmark based on risk categories from regulations and policies. arXiv Preprint arXiv:2407.17436.