How to Detect Suicidal Ideation in Your Chatbot

TL;DR comparison

If you're building a chatbot that handles personal or emotional topics, you need crisis detection. A 2025 study of 29 mental health chatbots found 48% gave inadequate responses to suicidal ideation, and none satisfied all criteria for an appropriate crisis response. Here's how the main approaches compare:

Approach	Crises caught*	Implicit signals	Severity levels	Resources	Cost
Keyword matching	—	No	No	No	Free
OpenAI Moderation	66/151 (44%)	Limited	3 categories	No	Free
Azure Content Safety	104/151 (74%)	Limited	0-7 scale	No	$0.0015/1K chars
Llama Guard	35/151 (23%)	Limited	Binary	No	GPU infra
Custom LLM prompt	Varies	Inconsistent	Uncalibrated	DIY	~$0.003
NOPE /evaluate	145/151 (96%)	Yes	Actionable flags	Matched	$0.003

*Out of 151 real crisis cases across 247 test cases (covering explicit ideation, passive ideation, method-seeking, self-harm, victimization, and benign controls). Full methodology and test cases. Updated Dec 2025.

Disclosure: This guide is written by the team behind NOPE, a crisis detection API. We've tried to present approaches honestly, including where ours costs more than alternatives. Our empirical comparisons are published with full methodology at suites.nope.net.

The harder challenge isn't detecting "I want to kill myself"—that's the easy case. It's detecting implicit signals ("Everyone would be better off without me"), method-seeking behavior ("What's the tallest bridge downtown?" after expressing distress), and distinguishing genuine distress from hyperbole ("lol gonna kms if I fail this exam"). See Why this is hard for the technical explanation.

Approach 1: Keyword matching

The simplest approach is maintaining a list of crisis-related keywords and triggering a response when they appear.

keyword_detection.py

CRISIS_KEYWORDS = [
    "kill myself", "end my life", "suicide",
    "want to die", "better off dead", "end it all"
]

def check_for_crisis(message: str) -> bool:
    message_lower = message.lower()
    return any(kw in message_lower for kw in CRISIS_KEYWORDS)

Pros

Simple to implement and understand
No external API dependencies
Fast (microseconds)
Predictable behavior

Cons

Misses implicit signals entirely: "I've made peace with everything" contains no keywords
High false positive rate: "I'm killing myself laughing" triggers the filter
Can't handle context: "My character in the game killed himself" is flagged
Easy to evade: Misspellings, euphemisms, and algospeak bypass filters
No severity discrimination: Passive ideation treated the same as active plan

The algospeak problem

Online self-harm communities actively develop coded vocabulary to evade keyword detection. Nearly one-third of U.S. social media users report using emojis or alternative phrases to subvert content moderation (Lookingbill & Le, 2024).

Term	Meaning
"sewer slide" / "unalive"	Suicide / death
"styro"	Wound reaching dermis layer
"beans"	Wound reaching fat layer
"yeet"	The act of cutting/self-harming
"juice"	Blood from self-harm wounds

Note: Algospeak evolves rapidly as communities adapt to evade detection. This table is a snapshot; any static list requires continuous updating.

Verdict: Keyword matching can be a fast first-pass filter, but should never be your only detection mechanism. Ophir et al. (2020) analyzed 83,292 Facebook posts and found that "the majority of at-risk users rarely posted content with explicit suicide-related terms."

Approach 2: Content moderation APIs

Major AI providers offer content moderation APIs that include self-harm detection as one of several harm categories.

OpenAI Moderation API

OpenAI's Moderation API is free and includes three self-harm categories:

self-harm: Content promoting or depicting self-harm
self-harm/intent: Speaker expresses intent to self-harm
self-harm/instructions: Instructions on self-harm methods

OpenAI does not publish benchmark accuracy for specific harm categories. Their docs note that "these scores should not be interpreted as probabilities."

Real-world performance: The Raine v. OpenAI lawsuit revealed that OpenAI's moderation flagged 377 messages for self-harm (23 scoring over 90% confidence) across months of conversation with a teen who later died by suicide. The system detected risk but didn't prevent escalation. In the final image uploaded (a noose), the API scored 0% for self-harm risk.

Azure Content Safety

Azure Content Safety provides severity scoring (0-7) across four categories: hate, sexual, violence, and self-harm.

Limitations:

1,000 character limit per API call
English-optimized (other languages may vary)
Microsoft explicitly recommends "meaningful human review"
No batch processing

Llama Guard

Meta's Llama Guard is an open-source safety classifier based on Llama. The S11 category covers self-harm including suicide, self-injury, and disordered eating.

Pros: Open-source, can self-host, no per-call costs

Cons: Requires GPU infrastructure, trained for general safety (not crisis-specific)

Google Perspective API

Perspective API focuses on toxicity detection (insults, threats, identity attacks). It does not have dedicated self-harm detection. This is a common misconception.

Empirical performance

We ran a head-to-head comparison across 247 test cases (151 crisis scenarios, 96 benign controls) covering explicit ideation, implicit signals, method-seeking, self-harm, victimization, and false positive controls. Here's how many crisis cases each provider caught:

Provider	Crises Caught	Missed	False Alarms	Key gaps
OpenAI Moderation	66 (44%)	85	19	Method-seeking, passive ideation, idiom over-flagging
Azure Content Safety	104 (74%)	36	19	Method-seeking, implicit signals
Llama Guard	35 (23%)	116	2	Method-seeking, passive ideation, implicit signals
NOPE /evaluate	145 (96%)	6	3	Some third-party attribution edge cases

View full methodology and test cases. n=247 cases (151 crisis, 96 benign). Run December 2025.

Structural limitations

Beyond accuracy, these APIs share several structural limitations:

Designed for policy enforcement, not crisis intervention: They tell you if content violates policies, not how to respond to someone in crisis.
No clinical grounding: A binary "self-harm detected" flag doesn't tell you if it's passive ideation or active planning with timeline.
No resource matching: Someone disclosing domestic violence gets the same generic crisis number as someone with suicidal ideation.
No audit trail: They return a score, not a rationale you can use for compliance documentation.
Context-blind: They evaluate single messages, not conversation patterns (escalation, de-escalation, returning to topics).

Verdict: Content moderation APIs are better than keyword matching but are fundamentally designed for a different purpose. They flag policy violations; they don't provide crisis response infrastructure. A systematic audit of 5 million queries across five commercial APIs found that all providers under-moderate implicit content using coded language.

Approach 3: LLM-based detection

You can prompt a general-purpose LLM (GPT-4, Claude, etc.) to evaluate messages for crisis signals.

llm_detection.py

CRISIS_DETECTION_PROMPT = """
Evaluate this message for suicide or self-harm risk.

Consider:
- Explicit statements of suicidal ideation
- Implicit signals (hopelessness, burdensomeness, farewell language)
- Method-seeking behavior
- Context that suggests hyperbole vs genuine distress

Return JSON with:
- risk_detected: boolean
- severity: "none" | "low" | "moderate" | "high"
- rationale: string

Message: {message}
"""

response = openai.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": prompt}]
)

Pros

Can understand context and nuance
Can distinguish hyperbole from genuine distress
Can detect implicit signals
Flexible: you control the prompt

Cons

Not calibrated: "Moderate" severity means different things to different prompts. Without grounding in clinical frameworks, severity levels are arbitrary.
No clinical grounding: The model doesn't know C-SSRS or clinical risk assessment without extensive prompt engineering
Development overhead: You need to build prompt logic, crisis resource matching, and audit logging yourself
Latency: 1-3 seconds per call for capable models
Still struggles with implicit signals: Research shows current LLMs significantly underperform on implicit ideation without specific training

What the research shows

LLM performance on crisis detection is mixed. Holley et al. (2025) found GPT ensembles achieved 94% sensitivity and 91% specificity on explicit crisis signals in synthetic journal entries, with 92% alignment on intervention decisions compared to clinicians (Cohen's Kappa = 0.84). However, synthetic data lacks the noise, typos, code-switching, and contextual ambiguity of real user messages—these results likely overestimate real-world performance.

But a 2025 study testing 8 major LLMs found that "current models struggle significantly with detecting implicit suicidal ideation" conveyed through metaphor, sarcasm, or subtle emotional cues. BERT achieves 93.9% on explicit ideation but fails on subtle implicit signals. The harder cases are exactly where LLMs struggle most.

Verdict: LLMs perform well on explicit signals but struggle with implicit ideation. A two-stage architecture combining lightweight models for explicit signals with LLMs for ambiguous cases addresses this tradeoff.

Approach 4: Purpose-built crisis APIs

Purpose-built APIs (including NOPE) are designed specifically for crisis detection rather than general content moderation. These APIs provide clinical grounding, not clinical validation. They use frameworks like C-SSRS to structure reasoning, but they are detection layers, not diagnostic tools. They identify linguistic patterns associated with crisis states; they do not predict outcomes, diagnose conditions, or replace human clinical judgment.

What makes them different

Clinical framework grounding: Detection reasoning informed by validated instruments like C-SSRS, HCR-20 (violence risk), and DASH (domestic abuse), adapted for text-based analysis
Actionable outputs: Returns whether to show crisis resources, what type of crisis was detected, and a rationale (not raw scores requiring clinical interpretation)
Matched crisis resources: Someone disclosing DV sees a DV hotline; someone with an eating disorder sees NEDA; someone with suicidal ideation sees 988
Audit-ready rationale: Each response includes rationale for why the classification was made, useful for compliance documentation
Implicit signal detection: Detects empirically-validated patterns like perceived burdensomeness, thwarted belongingness, hopelessness, and farewell language

NOPE's approach

NOPE provides two endpoints for crisis detection:

/v1/evaluate: Safety assessment ($0.003/call). Returns a risks[] array with type, subject, severity, and imminence for each detected risk, plus chain-of-thought rationale and matched crisis resources.

Example /v1/evaluate response

{
  "risks": [
    {
      "type": "suicide",
      "subject": "self",
      "severity": "moderate",
      "imminence": "chronic"
    }
  ],
  "show_resources": true,
  "speaker_severity": "moderate",
  "speaker_imminence": "chronic",
  "rationale": "Passive suicidal ideation expressed...",
  "resources": {
    "primary": {
      "name": "988 Suicide & Crisis Lifeline",
      "phone": "988",
      "chat_url": "https://988lifeline.org/chat/",
      "is_24_7": true,
      "why": "24/7 crisis support for suicidal thoughts"
    },
    "secondary": [
      {
        "name": "Crisis Text Line",
        "sms_number": "741741",
        "text_instructions": "Text HOME to 741741",
        "is_24_7": true,
        "why": "Text-based alternative for those who prefer not to call"
      }
    ]
  },
  "request_id": "sb243_1703001234567_abc123",
  "timestamp": "2024-12-19T10:30:00.000Z"
}

Cons

Higher per-call cost: $0.003 per call is more expensive than raw LLM calls (~$0.001). You're paying for the clinical grounding, resource matching, and audit infrastructure.
External API dependency: Another service in your stack that can fail
Not open-source: You can't inspect or modify the detection logic

Verdict: Purpose-built APIs trade higher per-call costs for clinical grounding, matched resources, and reduced development time. Whether that tradeoff makes sense depends on your scale, compliance requirements, and engineering capacity.

Why this is hard

The core technical challenge is that crisis signals are often compositional—they require synthesis across multiple pieces of information that are individually benign.

Input	Appropriate classification
"I lost my job today."	No risk (life stressor)
"What's the tallest bridge in NYC?"	No risk (neutral query)
"I lost my job today. What's the tallest bridge in NYC?"	High risk (method-seeking)

Traditional content classifiers evaluate each message (or even each sentence) independently. They're trained to recognize explicit harm patterns: "I want to kill myself" contains tokens strongly associated with self-harm. But the bridge query contains no such tokens. The classification only emerges from understanding that:

The first sentence establishes emotional distress (recent loss)
The second sentence, in that context, suggests method research
The juxtaposition is clinically meaningful even though neither part is flaggable alone

The method-seeking blind spot: Generic moderation APIs consistently fail on implicit method-seeking patterns. When we tested 17 cases combining emotional distress with location/means queries, all three major APIs scored 41-47%. They catch explicit drug overdose queries ("How many Tylenol would be too many?") but miss the pattern of distress + seemingly innocent location questions.

This same challenge appears in other patterns:

Context disambiguation: "What's the tallest bridge?" is benign for a tourist, concerning after distress disclosure. Same tokens, different meaning.
Resolution syndrome: "I've finally made peace with everything" sounds positive but after prolonged crisis can indicate imminent risk.
Subject attribution: "My roommate keeps saying everyone would be better off without her"—the speaker isn't at risk; their roommate is.
Distancing language: "Asking for a friend—how many pills would be lethal?"—the framing is likely deflection, not genuine third-party concern.

The false positive problem

Over-triggering erodes user trust. When your chatbot responds to "lol I bombed that exam, gonna kms 💀" with a crisis intervention, users learn to avoid your product—or learn to hide genuine distress. False positive crisis interventions also carry documented harms: unnecessary emergency wellness checks, involuntary psychiatric holds for non-suicidal individuals, and breakdown of therapeutic rapport.

The challenge is calibration: catching real crises without flagging hyperbole, dark humor, or professional discussions of the topic.

The limits of prediction

A meta-analysis of 50 years of research (365 studies, 3,428 effect sizes) found that prediction of suicidal behavior was "only slightly better than chance for all outcomes." No individual risk factor achieved greater than ~60% accuracy. Even the best clinical instruments show AUC values of 0.62-0.65. Detection systems need multi-factor assessment rather than single indicators, and even the best systems will never be perfect.

Regulatory context

If you're building an AI chatbot that serves users in certain jurisdictions, crisis detection may not be optional:

California SB 243 (effective Jan 2026): Requires "companion" AI chatbots to use "evidence-based methods" to detect suicidal ideation and self-harm, and to provide the 988 Suicide & Crisis Lifeline. Private right of action with minimum $1,000 per violation plus attorneys' fees.
NY's AI Companion Models Law (effective Nov 2025): Requires AI companions to detect and respond to suicidal ideation, with disclosure requirements. Penalties up to $15,000/day.
EU AI Act (full application Aug 2026): AI systems in medical devices are automatically high-risk. Requires risk management, data governance, and clinical validation. Penalties up to €35 million or 7% of global turnover.
UK Online Safety Act: Platforms have enforceable duties of care to protect users from harm, with fines up to 10% of global revenue.

The term "evidence-based methods" in SB 243 is not defined in the legislation. The Future of Privacy Forum analysis notes this "introduces practical ambiguity, as developers must determine which conversational indicators trigger reporting and what methodologies satisfy this requirement." Keyword matching alone likely doesn't qualify. Detection informed by clinical frameworks like C-SSRS provides documentation that your approach is grounded in research.

The FTC requires "competent and reliable scientific evidence" (preferably RCTs) for health claims. Positioning detection as a "safety feature" rather than "diagnostic tool" may reduce regulatory burden but does not eliminate the need for validation documentation.

Recommendations

If you're just starting

Start with a content moderation API (OpenAI's is free) as a baseline, but understand its limitations. It will catch explicit statements but miss implicit signals and won't provide matched resources.

If you need regulatory compliance

Use a purpose-built crisis detection API that provides:

Clinical framework grounding (documentation for "evidence-based methods")
Matched crisis resources (not just 988, but specific to the crisis type)
Audit trail (request IDs and rationale for compliance logging)

If you're building custom

If you're building your own detection:

Study the C-SSRS framework and understand the levels of ideation and what they mean
Test against implicit signal patterns, not just explicit keywords
Build a crisis resource database (or use an API like NOPE's /signpost)
Track false positive rates: over-triggering destroys user trust
Document your methodology for regulatory defense

Regardless of approach

Detection is only half the problem. What matters is what happens after detection. A system that flags crisis but responds poorly is potentially worse than no detection at all. Make sure your response flow:

Provides appropriate, non-dismissive acknowledgment
Offers relevant resources (matched to the specific crisis type)
Doesn't over-trigger on hyperbole or dark humor
Has a clear escalation path for high-severity cases

Research foundations

This section covers the clinical and research background that informs crisis detection systems. It's useful context for understanding why certain approaches work better than others, but you can skip it if you just need the practical guidance above.

The C-SSRS framework

The Columbia Suicide Severity Rating Scale (C-SSRS) is the most widely used instrument in FDA-regulated clinical trials for suicide assessment, available in 150+ country-specific languages (Posner et al., 2011). The FDA's 2012 draft guidance describes C-SSRS as "acceptable" for clinical studies while noting that other instruments "could also be acceptable" if they meet classification requirements. It distinguishes five levels of ideation:

Passive ideation: "I wish I were dead" or "I wish I could go to sleep and not wake up"
Active ideation without plan: "I've thought about ending my life"
Active ideation with plan: "I've been thinking about how I would do it"
Intent without specific plan: "I'm going to end it soon"
Intent with specific plan: Detailed plan with timeline

A 2024 meta-analysis of C-SSRS studies found that prior suicidal behavior predicts future non-fatal attempts (OR 3.14, 95% CI 1.86-5.31), though AUC values of 0.62-0.65 indicate prediction remains modest at the individual level (Lindh et al., 2018). Different severity levels require different responses: passive ideation may warrant support resources; active ideation with plan requires immediate escalation.

C-SSRS has been validated for structured electronic self-report (patients answering specific C-SSRS questions) but not for inferring risk from unstructured text conversations. Applying C-SSRS categories to free-text chat is an extrapolation from clinical use: useful for structuring assessment, but not clinically validated in this context.

Implicit signals

Between 33-50% of people who die by suicide explicitly communicate intent beforehand, rising to 60-80% when indirect signals are included (Isometsä, 2001). Critically, 78% of inpatients who died by suicide denied intent in their final clinical interview (Busch et al., 2003). Detection systems cannot rely solely on explicit statements.

Joiner's Interpersonal-Psychological Theory of Suicide (Van Orden et al., 2010) identifies two key constructs that predict suicidal desire:

Perceived burdensomeness: "Everyone would be better off without me." Meta-analysis of 122 samples (N=59,698) shows r = 0.48 correlation with suicidal ideation (Chu et al., 2017).
Thwarted belongingness: "No one would notice if I disappeared." Shows r = 0.37 correlation with ideation.

Other empirically-supported implicit signals come from different theoretical frameworks:

Hopelessness (Beck's cognitive model): Shows 0.80 sensitivity for suicide risk, though specificity is only 0.42 (McMillan et al., 2007).
Farewell language: "I just wanted to say thank you for everything." Identified in psychological autopsy studies as a behavioral warning sign.

Implicit death associations (measured via IAT) predict 6-month suicide attempts at approximately 6-fold elevated odds, exceeding the predictive validity of explicit ideation, depression, and clinician prediction (Nock et al., 2010). Not all commonly cited warning signs have strong evidence, though. "Sudden calm after distress" and "giving away possessions" appear in clinical guidelines but lack prospective empirical validation.

Linguistic markers

Research on language patterns shows that absolutist words (always, nothing, completely, never) differentiate suicidal ideation forums from depression/anxiety forums with very large effect size (d > 1.71) and track severity more faithfully than negative emotion words (Al-Mosaiwi & Johnstone, 2018).

Ready to add crisis detection?

Get started with NOPE's Evaluate API. $1 free credit, no credit card required.

Get API Key Read the docs

How to Detect Suicidal Ideation in Your Chatbot

TL;DR comparison

Approach 1: Keyword matching

Pros

Cons

The algospeak problem

Approach 2: Content moderation APIs

OpenAI Moderation API

Azure Content Safety

Llama Guard

Google Perspective API

Empirical performance

Structural limitations

Approach 3: LLM-based detection

Pros

Cons

What the research shows

Approach 4: Purpose-built crisis APIs

What makes them different

NOPE's approach

Cons

Why this is hard

The false positive problem

The limits of prediction

Regulatory context

Recommendations

If you're just starting

If you need regulatory compliance

If you're building custom

Regardless of approach

Research foundations

The C-SSRS framework

Implicit signals

Linguistic markers

Ready to add crisis detection?

Sources

Clinical frameworks & validation

Implicit ideation & warning signs

Detection approaches & research

API documentation & legal context