How to Detect Suicidal Ideation in Your Chatbot
A practical guide to implementing crisis detection in AI applications. We cover four approaches—from keyword matching to purpose-built APIs—with an honest assessment of what each can and can't do.
TL;DR comparison
If you're building a chatbot that handles personal or emotional topics, you need crisis detection. A 2025 study of 29 mental health chatbots found 48% gave inadequate responses to suicidal ideation, and none satisfied all criteria for an appropriate crisis response. Here's how the main approaches compare:
| Approach | Crises caught* | Implicit signals | Severity levels | Resources | Cost |
|---|---|---|---|---|---|
| Keyword matching | — | No | No | No | Free |
| OpenAI Moderation | 66/151 (44%) | Limited | 3 categories | No | Free |
| Azure Content Safety | 104/151 (74%) | Limited | 0-7 scale | No | $0.0015/1K chars |
| Llama Guard | 35/151 (23%) | Limited | Binary | No | GPU infra |
| Custom LLM prompt | Varies | Inconsistent | Uncalibrated | DIY | ~$0.001-$0.05 |
| NOPE /screen | 145/151 (96%) | Yes | Actionable flags | Matched | $0.001 |
*Out of 151 real crisis cases across 247 test cases (covering explicit ideation, passive ideation, method-seeking, self-harm, victimization, and benign controls). Full methodology and test cases. Updated Dec 2025.
The harder challenge isn't detecting "I want to kill myself"—that's the easy case. It's detecting implicit signals ("Everyone would be better off without me"), method-seeking behavior ("What's the tallest bridge downtown?" after expressing distress), and distinguishing genuine distress from hyperbole ("lol gonna kms if I fail this exam"). See Why this is hard for the technical explanation.
Approach 1: Keyword matching
The simplest approach is maintaining a list of crisis-related keywords and triggering a response when they appear.
CRISIS_KEYWORDS = [
"kill myself", "end my life", "suicide",
"want to die", "better off dead", "end it all"
]
def check_for_crisis(message: str) -> bool:
message_lower = message.lower()
return any(kw in message_lower for kw in CRISIS_KEYWORDS)Pros
- Simple to implement and understand
- No external API dependencies
- Fast (microseconds)
- Predictable behavior
Cons
- Misses implicit signals entirely: "I've made peace with everything" contains no keywords
- High false positive rate: "I'm killing myself laughing" triggers the filter
- Can't handle context: "My character in the game killed himself" is flagged
- Easy to evade: Misspellings, euphemisms, and algospeak bypass filters
- No severity discrimination: Passive ideation treated the same as active plan
The algospeak problem
Online self-harm communities actively develop coded vocabulary to evade keyword detection. Nearly one-third of U.S. social media users report using emojis or alternative phrases to subvert content moderation (Lookingbill & Le, 2024).
| Term | Meaning |
|---|---|
| "sewer slide" / "unalive" | Suicide / death |
| "styro" | Wound reaching dermis layer |
| "beans" | Wound reaching fat layer |
| "yeet" | The act of cutting/self-harming |
| "juice" | Blood from self-harm wounds |
Note: Algospeak evolves rapidly as communities adapt to evade detection. This table is a snapshot; any static list requires continuous updating.
Verdict: Keyword matching can be a fast first-pass filter, but should never be your only detection mechanism. Ophir et al. (2020) analyzed 83,292 Facebook posts and found that "the majority of at-risk users rarely posted content with explicit suicide-related terms."
Approach 2: Content moderation APIs
Major AI providers offer content moderation APIs that include self-harm detection as one of several harm categories.
OpenAI Moderation API
OpenAI's Moderation API is free and includes three self-harm categories:
self-harm: Content promoting or depicting self-harmself-harm/intent: Speaker expresses intent to self-harmself-harm/instructions: Instructions on self-harm methods
OpenAI does not publish benchmark accuracy for specific harm categories. Their docs note that "these scores should not be interpreted as probabilities."
Real-world performance: The Raine v. OpenAI lawsuit revealed that OpenAI's moderation flagged 377 messages for self-harm (23 scoring over 90% confidence) across months of conversation with a teen who later died by suicide. The system detected risk but didn't prevent escalation. In the final image uploaded (a noose), the API scored 0% for self-harm risk.
Azure Content Safety
Azure Content Safety provides severity scoring (0-7) across four categories: hate, sexual, violence, and self-harm.
Limitations:
- 1,000 character limit per API call
- English-optimized (other languages may vary)
- Microsoft explicitly recommends "meaningful human review"
- No batch processing
Llama Guard
Meta's Llama Guard is an open-source safety classifier based on Llama. The S11 category covers self-harm including suicide, self-injury, and disordered eating.
Pros: Open-source, can self-host, no per-call costs
Cons: Requires GPU infrastructure, trained for general safety (not crisis-specific)
Google Perspective API
Perspective API focuses on toxicity detection (insults, threats, identity attacks). It does not have dedicated self-harm detection. This is a common misconception.
Empirical performance
We ran a head-to-head comparison across 247 test cases (151 crisis scenarios, 96 benign controls) covering explicit ideation, implicit signals, method-seeking, self-harm, victimization, and false positive controls. Here's how many crisis cases each provider caught:
| Provider | Crises Caught | Missed | False Alarms | Key gaps |
|---|---|---|---|---|
| OpenAI Moderation | 66 (44%) | 85 | 19 | Method-seeking, passive ideation, idiom over-flagging |
| Azure Content Safety | 104 (74%) | 36 | 19 | Method-seeking, implicit signals |
| Llama Guard | 35 (23%) | 116 | 2 | Method-seeking, passive ideation, implicit signals |
| NOPE /screen | 145 (96%) | 6 | 3 | Some third-party attribution edge cases |
View full methodology and test cases. n=247 cases (151 crisis, 96 benign). Run December 2025.
Structural limitations
Beyond accuracy, these APIs share several structural limitations:
- Designed for policy enforcement, not crisis intervention: They tell you if content violates policies, not how to respond to someone in crisis.
- No clinical grounding: A binary "self-harm detected" flag doesn't tell you if it's passive ideation or active planning with timeline.
- No resource matching: Someone disclosing domestic violence gets the same generic crisis number as someone with suicidal ideation.
- No audit trail: They return a score, not a rationale you can use for compliance documentation.
- Context-blind: They evaluate single messages, not conversation patterns (escalation, de-escalation, returning to topics).
Verdict: Content moderation APIs are better than keyword matching but are fundamentally designed for a different purpose. They flag policy violations; they don't provide crisis response infrastructure. A systematic audit of 5 million queries across five commercial APIs found that all providers under-moderate implicit content using coded language.
Approach 3: LLM-based detection
You can prompt a general-purpose LLM (GPT-4, Claude, etc.) to evaluate messages for crisis signals.
CRISIS_DETECTION_PROMPT = """
Evaluate this message for suicide or self-harm risk.
Consider:
- Explicit statements of suicidal ideation
- Implicit signals (hopelessness, burdensomeness, farewell language)
- Method-seeking behavior
- Context that suggests hyperbole vs genuine distress
Return JSON with:
- risk_detected: boolean
- severity: "none" | "low" | "moderate" | "high"
- rationale: string
Message: {message}
"""
response = openai.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}]
)Pros
- Can understand context and nuance
- Can distinguish hyperbole from genuine distress
- Can detect implicit signals
- Flexible: you control the prompt
Cons
- Not calibrated: "Moderate" severity means different things to different prompts. Without grounding in clinical frameworks, severity levels are arbitrary.
- No clinical grounding: The model doesn't know C-SSRS or clinical risk assessment without extensive prompt engineering
- Development overhead: You need to build prompt logic, crisis resource matching, and audit logging yourself
- Latency: 1-3 seconds per call for capable models
- Still struggles with implicit signals: Research shows current LLMs significantly underperform on implicit ideation without specific training
What the research shows
LLM performance on crisis detection is mixed. Holley et al. (2025) found GPT ensembles achieved 94% sensitivity and 91% specificity on explicit crisis signals in synthetic journal entries, with 92% alignment on intervention decisions compared to clinicians (Cohen's Kappa = 0.84). However, synthetic data lacks the noise, typos, code-switching, and contextual ambiguity of real user messages—these results likely overestimate real-world performance.
But a 2025 study testing 8 major LLMs found that "current models struggle significantly with detecting implicit suicidal ideation" conveyed through metaphor, sarcasm, or subtle emotional cues. BERT achieves 93.9% on explicit ideation but fails on subtle implicit signals. The harder cases are exactly where LLMs struggle most.
Verdict: LLMs perform well on explicit signals but struggle with implicit ideation. A two-stage architecture combining lightweight models for explicit signals with LLMs for ambiguous cases addresses this tradeoff.
Approach 4: Purpose-built crisis APIs
Purpose-built APIs (including NOPE) are designed specifically for crisis detection rather than general content moderation. These APIs provide clinical grounding, not clinical validation. They use frameworks like C-SSRS to structure reasoning, but they are detection layers, not diagnostic tools. They identify linguistic patterns associated with crisis states; they do not predict outcomes, diagnose conditions, or replace human clinical judgment.
What makes them different
- Clinical framework grounding: Detection reasoning informed by validated instruments like C-SSRS, HCR-20 (violence risk), and DASH (domestic abuse), adapted for text-based analysis
- Actionable outputs: Returns whether to show crisis resources, what type of crisis was detected, and a rationale (not raw scores requiring clinical interpretation)
- Matched crisis resources: Someone disclosing DV sees a DV hotline; someone with an eating disorder sees NEDA; someone with suicidal ideation sees 988
- Audit-ready rationale: Each response includes rationale for why the classification was made, useful for compliance documentation
- Implicit signal detection: Detects empirically-validated patterns like perceived burdensomeness, thwarted belongingness, hopelessness, and farewell language
NOPE's approach
NOPE provides two endpoints for crisis detection:
/v1/screen: Lightweight crisis screening ($0.001/call). Returns arisks[]array with type, subject, severity, imminence, and confidence for each detected risk, plus rationale and matched crisis resources./v1/evaluate: Comprehensive risk assessment ($0.05/call). 9 risk types (suicide, self-harm, violence, abuse, stalking, etc.), 180+ clinical features, severity + imminence scoring.
{
"risks": [
{
"type": "suicide",
"subject": "self",
"severity": "moderate",
"imminence": "chronic",
"confidence": 0.85
}
],
"show_resources": true,
"suicidal_ideation": true,
"self_harm": false,
"rationale": "Passive suicidal ideation expressed...",
"resources": {
"primary": {
"name": "988 Suicide & Crisis Lifeline",
"phone": "988",
"chat_url": "https://988lifeline.org/chat/",
"is_24_7": true
},
"secondary": [
{
"name": "Crisis Text Line",
"sms_number": "741741",
"text_instructions": "Text HOME to 741741",
"is_24_7": true
}
]
},
"request_id": "sb243_1703001234567_abc123",
"timestamp": "2024-12-19T10:30:00.000Z"
}Cons
- Higher per-call cost: $0.001-$0.05 per call is more expensive than raw LLM calls (~$0.001-0.01). You're paying for the clinical grounding, resource matching, and audit infrastructure.
- External API dependency: Another service in your stack that can fail
- Not open-source: You can't inspect or modify the detection logic
Verdict: Purpose-built APIs trade higher per-call costs for clinical grounding, matched resources, and reduced development time. Whether that tradeoff makes sense depends on your scale, compliance requirements, and engineering capacity.
Why this is hard
The core technical challenge is that crisis signals are often compositional—they require synthesis across multiple pieces of information that are individually benign.
| Input | Appropriate classification |
|---|---|
| "I lost my job today." | No risk (life stressor) |
| "What's the tallest bridge in NYC?" | No risk (neutral query) |
| "I lost my job today. What's the tallest bridge in NYC?" | High risk (method-seeking) |
Traditional content classifiers evaluate each message (or even each sentence) independently. They're trained to recognize explicit harm patterns: "I want to kill myself" contains tokens strongly associated with self-harm. But the bridge query contains no such tokens. The classification only emerges from understanding that:
- The first sentence establishes emotional distress (recent loss)
- The second sentence, in that context, suggests method research
- The juxtaposition is clinically meaningful even though neither part is flaggable alone
The method-seeking blind spot: Generic moderation APIs consistently fail on implicit method-seeking patterns. When we tested 17 cases combining emotional distress with location/means queries, all three major APIs scored 41-47%. They catch explicit drug overdose queries ("How many Tylenol would be too many?") but miss the pattern of distress + seemingly innocent location questions.
This same challenge appears in other patterns:
- Context disambiguation: "What's the tallest bridge?" is benign for a tourist, concerning after distress disclosure. Same tokens, different meaning.
- Resolution syndrome: "I've finally made peace with everything" sounds positive but after prolonged crisis can indicate imminent risk.
- Subject attribution: "My roommate keeps saying everyone would be better off without her"—the speaker isn't at risk; their roommate is.
- Distancing language: "Asking for a friend—how many pills would be lethal?"—the framing is likely deflection, not genuine third-party concern.
The false positive problem
Over-triggering erodes user trust. When your chatbot responds to "lol I bombed that exam, gonna kms 💀" with a crisis intervention, users learn to avoid your product—or learn to hide genuine distress. False positive crisis interventions also carry documented harms: unnecessary emergency wellness checks, involuntary psychiatric holds for non-suicidal individuals, and breakdown of therapeutic rapport.
The challenge is calibration: catching real crises without flagging hyperbole, dark humor, or professional discussions of the topic.
The limits of prediction
A meta-analysis of 50 years of research (365 studies, 3,428 effect sizes) found that prediction of suicidal behavior was "only slightly better than chance for all outcomes." No individual risk factor achieved greater than ~60% accuracy. Even the best clinical instruments show AUC values of 0.62-0.65. Detection systems need multi-factor assessment rather than single indicators, and even the best systems will never be perfect.
Regulatory context
If you're building an AI chatbot that serves users in certain jurisdictions, crisis detection may not be optional:
- California SB 243 (effective Jan 2026): Requires "companion" AI chatbots to use "evidence-based methods" to detect suicidal ideation and self-harm, and to provide the 988 Suicide & Crisis Lifeline. Private right of action with minimum $1,000 per violation plus attorneys' fees.
- New York Article 47 (effective Nov 2025): Requires AI companions to detect and respond to suicidal ideation, with disclosure requirements. Penalties up to $15,000/day.
- EU AI Act (full application Aug 2026): AI systems in medical devices are automatically high-risk. Requires risk management, data governance, and clinical validation. Penalties up to €35 million or 7% of global turnover.
- UK Online Safety Act: Platforms have enforceable duties of care to protect users from harm, with fines up to 10% of global revenue.
The term "evidence-based methods" in SB 243 is not defined in the legislation. The Future of Privacy Forum analysis notes this "introduces practical ambiguity, as developers must determine which conversational indicators trigger reporting and what methodologies satisfy this requirement." Keyword matching alone likely doesn't qualify. Detection informed by clinical frameworks like C-SSRS provides documentation that your approach is grounded in research.
The FTC requires "competent and reliable scientific evidence" (preferably RCTs) for health claims. Positioning detection as a "safety feature" rather than "diagnostic tool" may reduce regulatory burden but does not eliminate the need for validation documentation.
Recommendations
If you're just starting
Start with a content moderation API (OpenAI's is free) as a baseline, but understand its limitations. It will catch explicit statements but miss implicit signals and won't provide matched resources.
If you need regulatory compliance
Use a purpose-built crisis detection API that provides:
- Clinical framework grounding (documentation for "evidence-based methods")
- Matched crisis resources (not just 988, but specific to the crisis type)
- Audit trail (request IDs and rationale for compliance logging)
If you're building custom
If you're building your own detection:
- Study the C-SSRS framework and understand the levels of ideation and what they mean
- Test against implicit signal patterns, not just explicit keywords
- Build a crisis resource database (or use an API like NOPE's /resources)
- Track false positive rates: over-triggering destroys user trust
- Document your methodology for regulatory defense
Regardless of approach
Detection is only half the problem. What matters is what happens after detection. A system that flags crisis but responds poorly is potentially worse than no detection at all. Make sure your response flow:
- Provides appropriate, non-dismissive acknowledgment
- Offers relevant resources (matched to the specific crisis type)
- Doesn't over-trigger on hyperbole or dark humor
- Has a clear escalation path for high-severity cases
Research foundations
This section covers the clinical and research background that informs crisis detection systems. It's useful context for understanding why certain approaches work better than others, but you can skip it if you just need the practical guidance above.
The C-SSRS framework
The Columbia Suicide Severity Rating Scale (C-SSRS) is the most widely used instrument in FDA-regulated clinical trials for suicide assessment, available in 150+ country-specific languages (Posner et al., 2011). The FDA's 2012 draft guidance describes C-SSRS as "acceptable" for clinical studies while noting that other instruments "could also be acceptable" if they meet classification requirements. It distinguishes five levels of ideation:
- Passive ideation: "I wish I were dead" or "I wish I could go to sleep and not wake up"
- Active ideation without plan: "I've thought about ending my life"
- Active ideation with plan: "I've been thinking about how I would do it"
- Intent without specific plan: "I'm going to end it soon"
- Intent with specific plan: Detailed plan with timeline
A 2024 meta-analysis of C-SSRS studies found that prior suicidal behavior predicts future non-fatal attempts (OR 3.14, 95% CI 1.86-5.31), though AUC values of 0.62-0.65 indicate prediction remains modest at the individual level (Lindh et al., 2018). Different severity levels require different responses: passive ideation may warrant support resources; active ideation with plan requires immediate escalation.
C-SSRS has been validated for structured electronic self-report (patients answering specific C-SSRS questions) but not for inferring risk from unstructured text conversations. Applying C-SSRS categories to free-text chat is an extrapolation from clinical use: useful for structuring assessment, but not clinically validated in this context.
Implicit signals
Between 33-50% of people who die by suicide explicitly communicate intent beforehand, rising to 60-80% when indirect signals are included (Isometsä, 2001). Critically, 78% of inpatients who died by suicide denied intent in their final clinical interview (Busch et al., 2003). Detection systems cannot rely solely on explicit statements.
Joiner's Interpersonal-Psychological Theory of Suicide (Van Orden et al., 2010) identifies two key constructs that predict suicidal desire:
- Perceived burdensomeness: "Everyone would be better off without me." Meta-analysis of 122 samples (N=59,698) shows r = 0.48 correlation with suicidal ideation (Chu et al., 2017).
- Thwarted belongingness: "No one would notice if I disappeared." Shows r = 0.37 correlation with ideation.
Other empirically-supported implicit signals come from different theoretical frameworks:
- Hopelessness (Beck's cognitive model): Shows 0.80 sensitivity for suicide risk, though specificity is only 0.42 (McMillan et al., 2007).
- Farewell language: "I just wanted to say thank you for everything." Identified in psychological autopsy studies as a behavioral warning sign.
Implicit death associations (measured via IAT) predict 6-month suicide attempts at approximately 6-fold elevated odds, exceeding the predictive validity of explicit ideation, depression, and clinician prediction (Nock et al., 2010). Not all commonly cited warning signs have strong evidence, though. "Sudden calm after distress" and "giving away possessions" appear in clinical guidelines but lack prospective empirical validation.
Linguistic markers
Research on language patterns shows that absolutist words (always, nothing, completely, never) differentiate suicidal ideation forums from depression/anxiety forums with very large effect size (d > 1.71) and track severity more faithfully than negative emotion words (Al-Mosaiwi & Johnstone, 2018).
Ready to add crisis detection?
Get started with NOPE's Screen API. $1 free credit, no credit card required.
Sources
Clinical frameworks & validation
- Columbia Suicide Severity Rating Scale (C-SSRS)
- Posner et al. (2011). The Columbia-Suicide Severity Rating Scale: Initial validity and internal consistency findings. American Journal of Psychiatry.
- British Journal of Psychiatry (2024). Prediction of fatal and non-fatal suicide attempts by the C-SSRS: systematic review and meta-analysis.
- Lindh et al. (2018). Short term risk of non-fatal and fatal suicidal behaviours: The predictive validity of the C-SSRS. BMC Psychiatry.
- Franklin et al. (2017). Risk factors for suicidal thoughts and behaviors: A meta-analysis of 50 years of research. Psychological Bulletin.
Implicit ideation & warning signs
- Van Orden et al. (2010). The interpersonal theory of suicide. Psychological Review.
- Chu et al. (2017). The interpersonal theory of suicide: A systematic review and meta-analysis. Psychological Bulletin.
- Busch et al. (2003). Clinical features of inpatient suicide. Psychiatric Services.
- Isometsä (2001). Psychological autopsy studies: A review. European Psychiatry.
- Nock et al. (2010). Measuring the suicidal mind: Implicit cognition predicts suicidal behavior. Psychological Science.
- Al-Mosaiwi & Johnstone (2018). Absolutist words as markers of anxiety, depression, and suicidal ideation. Clinical Psychological Science.
- McMillan et al. (2007). Can we predict suicide with the Beck Hopelessness Scale? Psychological Medicine.
Detection approaches & research
- Ophir et al. (2020). Deep neural networks detect suicide risk from textual Facebook posts. Nature Scientific Reports.
- Holley et al. (2025). Evaluating GPT models for suicide risk assessment. BMC Psychiatry.
- Can Large Language Models Identify Implicit Suicidal Ideation? (2025)
- Lost in moderation: How commercial content moderation APIs over- and under-moderate (2025)
- Lookingbill & Le (2024). Research on NSSI content and algospeak on social media.
- Performance of mental health chatbot agents in detecting and managing suicidal ideation (2025)
- Wysa (2024). AI detects 82% of mental health app users in crisis.