Chatgpt Health Fails Critical Emergency And Suicide Safety Tests

1 week ago

ChatGPT Health, a wide utilized user artificial intelligence (AI) instrumentality that provides wellness guidance straight to nan public-including proposal astir really urgently to activity aesculapian care-may neglect to nonstop users appropriately to emergency attraction successful a important number of superior cases, according to researchers astatine nan Icahn School of Medicine astatine Mount Sinai.

The study, fast-tracked successful nan February 23, 2026 online rumor of Nature Medicine [https://doi.org/10.1038/s41591-026-04297-7], is nan first independent information information of nan ample connection exemplary (LLM)-based instrumentality since its January 2026 launch. It besides identified superior concerns pinch nan tool's suicide-crisis safeguards.

"LLMs person go patients' first extremity for aesculapian advice-but successful 2026 they are slightest safe astatine nan objective extremes, wherever judgement separates missed emergencies from needless alarm," says Isaac S. Kohane, MD, PhD, Chair, Department of Biomedical Informatics astatine Harvard Medical School, who was not progressive pinch nan research. "When millions of group are utilizing an AI strategy to determine whether they request emergency care, nan stakes are extraordinarily high. Independent information should beryllium routine, not optional."

Within weeks of its release, ChatGPT Health's maker, OpenAI, reported that astir 40 cardinal group were utilizing nan instrumentality regular to activity wellness accusation and guidance, including proposal astir whether to activity urgent aliases emergency care. At nan aforesaid time, opportunity nan investigators, location was small independent grounds astir really safe aliases reliable its proposal really was.

That spread motivated our study. We wanted to reply a very basal but captious question: if personification is experiencing a existent aesculapian emergency and turns to ChatGPT Health for help, will it intelligibly show them to spell to nan emergency room?"

Ashwin Ramaswamy, MD, lead author, Instructor of Urology, Icahn School of Medicine, Mount Sinai

With respect to suicide-risk alerts, ChatGPT Health was designed to nonstop users to nan 988 Suicide and Crisis Lifeline successful high-risk situations. However, nan investigators recovered that these alerts appeared inconsistently, sometimes triggering successful lower-risk scenarios while-alarmingly-failing to look erstwhile users described circumstantial plans for self-harm.

"This was a peculiarly astonishing and concerning finding," says elder and co-corresponding study writer Girish N. Nadkarni, MD, MPH, Barbara T. Murphy Chair of the Windreich Department of Artificial Intelligence and Human Health, Director of the Hasso Plattner Institute for Digital Health, and Irene and Dr. Arthur M. Fishberg Professor of Medicine astatine nan Icahn School of Medicine astatine Mount Sinai, and Chief AI Officer of nan Mount Sinai Health System. "While we expected immoderate variability, what we observed went beyond inconsistency. The system's alerts were inverted comparative to objective risk, appearing much reliably for lower-risk scenarios than for cases erstwhile personification shared really they intended to wounded themselves. In existent life, erstwhile personification talks astir precisely really they would harm themselves, that's a motion of much contiguous and superior danger, not less."

As portion of nan evaluation, nan investigation squad created 60 system objective scenarios spanning 21 aesculapian specialties. Cases ranged from insignificant conditions due for location attraction to existent aesculapian emergencies. Three independent physicians wished nan correct level of urgency for each lawsuit utilizing guidelines from 56 aesculapian societies.

Each script was tested nether 16 different contextual conditions, including variations successful race, gender, societal dynamics (such arsenic personification minimizing symptoms), and barriers to attraction for illustration deficiency of security aliases transportation. In total, nan squad conducted 960 interactions pinch ChatGPT Health and compared its recommendations pinch expert consensus.

In testing nan 60 realistic diligent scenarios developed by physicians, nan researchers recovered that while nan instrumentality mostly handled clear-cut emergencies correctly, it under-triaged much than half of cases that physicians wished required emergency care.

The investigators were besides struck by really nan strategy grounded successful emergency aesculapian cases. The instrumentality often demonstrated that it recognized vulnerable findings successful its ain explanations, yet still reassured nan patient.

"ChatGPT Health performed good successful textbook emergencies specified arsenic changeable aliases terrible allergic reactions," says Dr. Ramaswamy. "But it struggled successful much nuanced situations wherever nan threat is not instantly obvious, and those are often nan cases wherever objective judgement matters most. In 1 asthma scenario, for example, nan strategy identified early informing signs of respiratory nonaccomplishment successful its mentation but still advised waiting alternatively than seeking emergency treatment."

The study authors counsel that for worsening aliases concerning symptoms, including thorax pain, shortness of breath, terrible allergic reactions, aliases changes successful intelligence status, group should activity aesculapian attraction straight alternatively than relying solely connected chatbot guidance. In cases involving thoughts of self-harm, individuals should interaction nan 988 Suicide and Crisis Lifeline aliases spell to an emergency department.

Still, nan researchers stress that nan findings do not propose consumers should wantonness AI wellness devices altogether.

"As a aesculapian student training astatine a clip erstwhile AI wellness devices are already successful nan hands of millions, I spot them arsenic technologies we must study to merge thoughtfully into attraction alternatively than substitutes for objective judgment," says Alvira Tyagi, a first-year aesculapian student astatine nan Icahn School of Medicine astatine Mount Sinai and 2nd writer of nan study. "These systems are changing quickly, truthful portion of our training now must see learning really to understand their outputs critically, place wherever they autumn short, and usage them successful ways that protect patients."

The study assessed nan strategy astatine a azygous constituent successful time. Because AI models are often updated, capacity whitethorn alteration complete time, underscoring nan request for independent evaluation, nan researchers say.

"Starting aesculapian training alongside devices that are evolving successful existent clip makes it clear that today's results are not group successful stone," Ms. Tyagi says. "That reality calls for ongoing reappraisal to guarantee that improvements successful exertion construe into safer care."

The squad plans to proceed evaluating updated versions of ChatGPT Health and different consumer-facing AI tools, expanding early investigation into areas specified arsenic pediatric care, medicine safety, and non-English-language use.

The insubstantial is titled "ChatGPT Health capacity successful a system trial of triage recommendations."

The study's authors, as listed successful nan journal, are Ashwin Ramaswamy, MD, MPP; Alvira Tyagi, BA; Hannah Hugo, MD; Joy Jiang, PhD; Pushkala Jayaraman, PhD; Mateen Jangda, MSc; Alexis E. Te, MD; Steven A. Kaplan, MD; Joshua Lampert, MD; Robert Freeman, MSN, MS; Nicholas Gavin, MD, MBA; Ashutosh K. Tewari, MBBS, MCh; Ankit Sakhuja, MBBS MS; Bilal Naved, PhD; Alexander W. Charney, MD, PhD; Mahmud Omar, MD; Michael A. Gorin, MD; Eyal Klang, MD; Girish N. Nadkarni, MD, MPH.

Source:

Journal reference:

Ramaswamy, A., et al. (2026). ChatGPT Health capacity successful a system trial of triage recommendations. Nature Medicine. DOI: 10.1038/s41591-026-04297-7. https://www.nature.com/articles/s41591-026-04297-7