January 19, 2025

Harmony Thrive

Superior Health, Meaningful Life

Deep learning models can predict violence and threats against healthcare providers using clinical notes

Deep learning models can predict violence and threats against healthcare providers using clinical notes

Dataset

With the approval of our hospital leadership, we obtained two datasets of incidents from March 2021 to October 2023 (2.5 years) involving hospital security.

  1. 1.

    Code grays are security staff response incidents of patients attempting to leave or resist care when this poses an imminent danger to themselves or others.

  2. 2.

    Staff patient safety net (PSN) incidents involve violence against hospital staff, often leading to significant staff distress.

The two events often coincide. We first combined, and then de-duplicated the two datasets in order to ensure a given incident was recorded only once. Many Code Gray and PSN events are preceded by staff taking precautions when patients appear potentially violent or threatening, often mentioned in clinical notes. As we aimed to limit our predictions to only cases where the event was surprising and unanticipated—and thus prediction beforehand would be of most value—we filtered out any events in which, in clinical notes in the 3 days preceding the timestamp of the event, we found any mention of terms indicative of security monitoring, such as “1:1”, “sitter”, “detained”, “against medical advice”, and so on. This resulted in 280 cases with unique timestamps from 246 unique patients. Using these 280 cases, we extracted clinical notes in the 3 days preceding a given timestamp. As patients have many clinical documents written, some less potentially useful for our analysis, we limited clinical notes to only those of “H&P” (history and physical) and “Nursing Note” types. H&P notes are typically longer and describe the patient history and reason for hospitalization, recent labs and medication orders, and so on, and plan for care. Nursing notes tend to be shorter, more frequent updates on patient status. For each of the unique events, we concatenated the notes into a single long document.

Case-control matching

We achieved a 1:1 matching of cases and control patients. To do so, we wrote an algorithm querying our data warehouse to match cases to a control patient of the same biological sex, same age within ±5 years, admitted in the same 2.5-year time window, and with the highest number of matching ICD-10 diagnosis codes for the encounter, limited to patients with an H&P note written. For example, a male case patient who was 46-years-old at the time of the violent event and admitted with diagnosis codes for Altered Mental Status (R41.82), Wheezing (R06.2), and Tachycardia (R00.00), would be matched to the first male control patient aged 41–51 and with the same admitting diagnoses found. If no patient had all 3 diagnosis codes, the first patient with 2 of the 3 would be selected, and so on. This resulted in 280 corresponding control patients. We used the time 3 days after the initial H&P note as an artificial timestamp for controls and similarly created a long document for each. Our combined case + control dataset thus had 560 total documents.

Risk factor annotation schema and annotation

Because our long-term goal is to prospectively identify at-risk providers and patients in order to intervene and de-escalate potential violent events, ranking risk as a regression task alone (i.e., predicting a normalized value between 0.0 and 1.0) does not lend sufficient explainability. For example, if a psychiatry team received a daily report of providers and patients at risk but the report showed only a single predicted continuous value alongside each patient (e.g., “John Doe: 0.387”), it would be impossible to understand why a model output such a prediction without actually reviewing a patient’s chart—with the caveat that even then, human understanding and factors influencing model prediction may not align. While possible solutions such as SHAP23 could be utilized to highlight segments of clinical notes correlated to a given model output, highlights across an entire set of long concatenated notes could be time-consuming to read, when instead a short summary would be preferable. We, therefore, reasoned that developing an annotated corpus of risk factors for violence from the perspective of psychiatrists could be used to train an NER model and aid in explainability and summary for such a future report. Risk factor identification could thereby guide actionable, targeted interventions such as substance use treatment, suicide precautions, or delirium management.

Our lead psychiatry team members developed an annotation guideline for the following eight categories:

  1. 1.

    Aggressive behavior—Aggressive actions observed during hospitalization and those performed prior to admission. Observed behaviors were based on the validated Broset Violence Checklist24, a validated, short-term violence prediction instrument. Previous actions included interpersonal violence perpetration and victimization, such as the history of assaults or suicide attempts.

  2. 2.

    Cognition—The six neurocognitive functioning domains: (1) memory and learning, (2) language, (3) executive functioning, (4) complex attention, (5) social cognition, and (6) perceptual and motor functioning. In clinical documentation, cognitive impairments are often reported through a patient’s levels of alertness, orientation status, and ability to comprehend medical care and communicate medical decisions.

  3. 3.

    Mood symptoms—Based on DSM-5 symptomatology for depressive, anxious, and manic conditions, this category identified disordered alterations in patients’ emotional states during the current hospitalization such as suicidal ideation, hopelessness, rumination, panic, and grandiosity.

  4. 4.

    Psychotic symptoms—A loss of contact with reality. This category included positive symptoms such as delusions, hallucinations, and disorganized thoughts and behaviors, and negative symptoms such as emotional blunting, avolition, and poverty of thought.

  5. 5.

    Acute substance use—The recent recreational use of mood-altering substances, both legal (such as nicotine, alcohol, and cannabis) and illicit (such as opioids and stimulants). Recent use prior to or during admission could be self-reported or referenced through toxicology results. In addition, this category included signs and symptoms of active substance intoxication, withdrawal, and craving.

  6. 6.

    Unmet needs/interpersonal conflict—Patient-reported dissatisfactions with care. This included concrete complaints such as poorly controlled pain, interrupted sleep, rescheduled procedures, and premature or delayed discharge. It also included abstract patient perceptions of mistreatment by medical providers, such as feeling ignored or judged.

  7. 7.

    Noncompliance—Patient refusal to participate in medically necessary care. This included refusing to take scheduled medications, following physical restrictions, and participating in clinical interviews, exams and diagnostic interventions such as lab draws and imaging studies. It also included the purposeful removal of medical equipment such as braces or bandages.

  8. 8.

    High care utilization—A disproportionate burden on the healthcare system due to elevated resource use. This included references to bounce-back medical admissions, frequent emergency room visits, past psychiatric hospitalizations (for either voluntary or involuntary treatment), past incarcerations, and past or current treatment with community mental health organizations.

For our 560 total documents, we enlisted the help of six total annotators of various levels of experience in psychiatry: two attending psychiatrists, three psychiatry residents, and one medical student entering psychiatry residency. In addition to annotating risk factors, we also utilized our psychiatry team to establish a baseline of how well human experts can predict an upcoming violent event by reading clinical notes. To do so, at the end of each document, we added the text << CODE_GRAY_OR_PSN_WILL_OCCUR >>, which annotators were instructed to annotate as Yes or No based on their experience and intuition. In order to ensure a high quality of annotation, we first trained all annotators on the same 20 randomly selected documents, then copied all annotation variations for each training document into a “differential” file from which the annotation team used to reconcile differences. We then double-annotated the remaining documents by pairing annotators and randomly assigning 135 documents to each group, split into batches. After each round of annotation was complete, we generated differential files for each pair and completed the reconciliation. Two of the annotators were paired twice, and thus annotated approximately twice the number of documents as others. If a pair differed in their Yes/No prediction of violence in a given reconciled document, a 3rd “tie-breaker” psychiatrist from our annotation leaders reviewed the document and determined a final Yes/No prediction. Among annotators, prior to reconciliation, the mean pairwise inter-annotator agreement measured by F1 score using a relaxed scoring method requiring matching labels and overlapping (though not identical) token spans was 0.71, indicating reasonably high agreement.

Baseline regression model

In addition to our human psychiatry group, we also aimed to create an additional baseline regression model using largely structured data elements. We based the inputs to this model on the MEND Screening Model at the University of Pennsylvania25, using patient age at prediction, psychiatric diagnoses associated with medication orders, presence of active mood disorders and anxiety-related diagnoses on the problem list or past billing diagnosis codes, whether antidepressants were administered in the encounter, and prior ED visits with the psychiatric complaint within past two years. Additionally, we explored the inclusion of keyword items in preceding clinical notes relating to anxiety, abuse, psychiatry, depression, withdrawal, ativan, alprazolam, and intravenous drug use.

Evaluation

At a high level, our evaluation sought to answer the following questions:

  1. 1.

    How well can a document classification model predict violence? How does this compare to human experts or models using structured data? Alongside our human and structured data baseline, we aimed to create a deep learning model capable of matching or surpassing human experts in prediction, trained on an unannotated raw text dataset. As documents in our dataset tended to be long, we fine-tuned the Clinical-Longformer base model26 for document classification after randomly splitting our dataset into an 80/20 split of train and test documents. Clinical-Longformer builds upon the findings of the Longformer model27, which replaces the standard full attention mechanism in Transformer models with a combination of sliding window and sparse global attention, and in doing so reduces memory consumption and allows for a larger context window. As discussed, our documents were concatenations of multiple clinical notes prior to a given timestamp for a single patient. In cases where a given document exceeded the length of the context window allowed by Clinical-Longformer or our GPU memory constraints, after some experimentation, we found that cropping the document by taking only the beginning and ending (up to 50% of characters allowed in the context window on each side), effectively removing the middle of the document, to work reasonably well.

  2. 2.

    How well can a NER model predict risk factors? Using the risk factor annotations by our psychiatry team, we trained an NER model to predict an output label for each token in a given document. In order to evaluate every word in a given document, we used a moving-window strategy, splitting each document based on the maximum allowed tokens for a given model’s context window, then evaluating each window of text using the Bio_ClinicalBERTbase model28.

We use the F1 score as our primary evaluation metric in both tasks, where F1 = 2 × (precision × recall)/(precision + recall).

All experiments were approved by our institutional review board (IRB #00018889).

link

Leave a Reply

Your email address will not be published. Required fields are marked *

Copyright © All rights reserved. | Newsphere by AF themes.