The burgeoning integration of artificial intelligence (AI) within healthcare continues its rapid ascent, promising to revolutionize everything from diagnostics to personalized treatment plans. A significant area of application has been in predictive analytics, particularly concerning patient health data and behavior to assess the risk of various illnesses or chronic conditions. Among these, the use of AI to generate Opioid Risk Scores (ORS) has become increasingly prevalent, with systems like NarxCare and Epic’s proprietary tools scanning vast electronic health records (EHRs) and prescription drug databases. The stated aim is to identify patients at heightened risk of opioid misuse or overdose, thereby informing healthcare providers and potentially guiding prescribing decisions. However, a groundbreaking new study, the first of its kind, casts serious doubt on the efficacy and reliability of these AI-generated scores, revealing unacceptably high rates of false positives that could have profound negative implications for patient care.
A Deep Dive into the Study’s Findings
The pivotal research, recently published in the esteemed Journal of General Internal Medicine, meticulously examined Epic’s opioid risk scores for a substantial cohort of over 700,000 U.S. patients under the care of primary care providers. The scale of the study underscores the widespread adoption and influence of such systems in contemporary clinical practice. A striking initial observation was that the vast majority of patients—a remarkable 99.6%—were classified by the Epic system as being at low risk for an overdose or Opioid Use Disorder (OUD). Conversely, a small fraction, just 0.4% of the patient population, was flagged as high risk.
While the high percentage of patients deemed low risk might appear reassuring on the surface, the study delved deeper into the predictive accuracy of these classifications in relation to actual patient outcomes over a 12-month period. For the 702,099 patients initially classified as low risk, only 2,177 subsequently experienced an overdose or received an OUD diagnosis within the following year. This translates to an impressive 99.7% accuracy rate for correctly predicting outcomes within the low-risk category.
However, the picture changes dramatically when analyzing the high-risk group. Of the 2,665 patients that Epic’s system identified as high risk, a mere 185 individuals later went on to have an overdose or an OUD diagnosis. This finding is particularly concerning, as it indicates that Epic’s scoring system correctly predicted outcomes in the high-risk category only about 7% of the time. This staggering disparity reveals a false positive rate of 92.2% within the high-risk classification. Such a high rate, as the researchers pointed out, means that Epic’s ORS "produces too many false alarms" and offers minimal practical value to healthcare providers attempting to make informed clinical decisions.
Dr. Stephanie Hooker, PhD, a Research Investigator at HealthPartners Institute and lead author of the study, articulated the severity of these findings. "In this study, most high-risk patients were false positives, and most who developed OUD or overdosed were false negatives. Because these outcomes are rare, achieving adequate PPV (the proportion of cases that are accurate) is challenging. The ORS’s misclassification could undermine its external validity, leading to misallocated resources and missed interventions." The implications of "missed interventions" are profound, potentially leading to patients being unjustly denied necessary opioid medication for legitimate pain, or, conversely, being unnecessarily referred for addiction treatment when no such disorder exists.
Furthermore, the study highlighted another critical flaw: the system’s inability to consistently identify actual cases. Despite the 99.7% success rate in identifying low-risk patients, the system flagged only 185 of the total 2,362 patients who did experience an overdose or OUD diagnosis as high risk. This means over two thousand individuals who genuinely suffered from these outcomes were incorrectly categorized as "low risk," creating a dangerous sense of false reassurance for clinicians. The overarching lesson, therefore, appears to be that neither "low risk" nor "high risk" designations from these systems offer definitive certainty, underscoring the limitations of current AI models in this sensitive domain.
The Genesis of Opioid Risk Scores: A Response to Crisis
To fully appreciate the context in which these AI-driven tools emerged, it is essential to revisit the opioid crisis in the United States. Beginning in the late 1990s, the nation witnessed a dramatic increase in opioid prescriptions, fueled by aggressive marketing and a shift in medical practice towards more liberal pain management. This surge in prescribing, particularly for chronic non-cancer pain, inadvertently laid the groundwork for a public health catastrophe. The crisis has unfolded in distinct waves: the first characterized by prescription opioid overdose deaths, the second by heroin overdose deaths, and the third, ongoing wave driven by synthetic opioids like fentanyl. The Centers for Disease Control and Prevention (CDC) estimates that hundreds of thousands of Americans have died from opioid-related overdoses, with the crisis costing the U.S. economy trillions of dollars in healthcare, lost productivity, and criminal justice expenditures.
In response to this escalating crisis, policymakers and healthcare stakeholders sought innovative solutions to curb misuse, prevent addiction, and reduce overdose fatalities. One key strategy involved tightening prescribing guidelines and implementing Prescription Drug Monitoring Programs (PDMPs). These state-level electronic databases collect information on controlled substance prescriptions, allowing prescribers and pharmacists to review a patient’s prescription history and identify potential "doctor shoppers" or those at high risk of misuse.
It was within this landscape that AI and machine learning began to be explored as powerful tools. The rationale was compelling: human clinicians, despite their expertise, are limited in their ability to synthesize vast amounts of data from EHRs, PDMPs, and other sources to identify complex patterns indicative of risk. AI promised to automate this process, providing rapid, data-driven insights to flag high-risk individuals and improve patient safety. Companies like Bamboo Health (developer of NarxCare) and Epic Systems (a dominant EHR vendor with integrated ORS) positioned their solutions as critical aids in this fight, designed to support clinical decision-making and enhance the responsible prescribing of opioids. NarxCare, for instance, is widely integrated into major pharmacy chains such as Walmart, Rite Aid, and CVS, analyzing patient risk profiles at the point of dispensing. Similarly, Epic’s MyChart software, used by countless healthcare systems, has collected data on over 190 million patients, making its ORS a pervasive presence in U.S. healthcare.
Expert Criticism and Regulatory Scrutiny
The study’s findings resonate strongly with criticisms previously voiced by pain management experts and patient advocates. Dr. Lynn Webster, a distinguished pain management expert and Senior Fellow at the Center for U.S. Policy (CUSP), has been a vocal critic of the overreliance on these AI-driven scores. He emphatically states that no opioid risk score, whether from Epic or NarxCare, should be viewed as authoritative by doctors and pharmacists when making crucial clinical decisions.
Webster elaborates on the potential harms: "Both tools can be harmful if used punitively. The NarxCare scores have shown that overestimated risk may lead to forced tapering, abandonment, or other punitive responses, which could paradoxically increase overdose risk. With Epic, the harm is a bit different: the score can both stigmatize flagged patients and falsely reassure clinicians about the much larger group labeled low risk." This highlights a dual danger: patients genuinely in need of pain management may be denied medication or labeled as "addicts" due to an inaccurate high-risk score, leading to abandonment and potentially driving them to illicit sources, thus increasing their overdose risk. Conversely, patients truly at risk might be overlooked due if they are misclassified as low risk, leading to missed opportunities for intervention.
CUSP, under Webster’s guidance, took direct action in 2023 by petitioning the U.S. Food and Drug Administration (FDA) to remove NarxCare’s ORS software from the market. The petition argued that NarxCare was an unproven and misbranded medical device. However, the FDA rejected the petition on procedural grounds, a decision that further fueled concerns among critics about the regulatory oversight of such widely deployed AI tools. The FDA’s stance on AI as a medical device is evolving, and the agency faces the complex challenge of balancing innovation with patient safety, particularly for algorithms that can directly influence patient care.
Webster also points out a fundamental conceptual flaw in models that conflate OUD and overdose risk. He argues that these are distinct events, and combining them into a single prediction model introduces significant inaccuracies. "Someone can overdose without having OUD, while someone can have OUD without ever experiencing an overdose," he explains. He further elaborates to PNN (Pain News Network) that "Opioid risk tools will always struggle to predict overdose death risk because overdoses can occur in patients who have no opioid use disorder and no aberrant drug-related behavior. Some patients overdose even when they take their medications exactly as prescribed. Overdose can also occur because of comorbid medical conditions or other factors unrelated to OUD." This distinction is critical because predictive models that fail to account for these nuances will inherently produce flawed results, misdirecting resources and potentially harming patients.
Broader Impact and Ethical Implications
The widespread integration of these AI-driven risk scores into the fabric of the U.S. healthcare system means that their flaws have far-reaching implications. The danger, as Webster succinctly puts it, is that "once a proprietary risk label is embedded in the chart, it can take on a false authority that changes how patients are treated." This "false authority" can override clinical judgment, leading to alert fatigue among providers who are constantly bombarded with alerts, or, worse, a blind reliance on algorithmic pronouncements.
For patients, the consequences can be devastating. Being incorrectly flagged as "high risk" can lead to stigmatization, denial of necessary pain medication, strained patient-provider relationships, and even medical abandonment. This not only compromises their quality of life but can also erode trust in the healthcare system. Conversely, the false reassurance for low-risk patients can lead to missed opportunities for early intervention in those who genuinely develop OUD or experience an overdose.
From a healthcare policy perspective, the study highlights an urgent need for more rigorous independent validation of AI tools before their widespread deployment, especially in high-stakes clinical areas. Regulatory bodies like the FDA must establish clearer guidelines and oversight mechanisms for AI algorithms classified as medical devices, ensuring transparency in their development, testing, and ongoing performance monitoring. The issue of algorithmic bias is also paramount; if the underlying data used to train these models reflects historical biases against certain demographic groups, the ORS could inadvertently perpetuate health inequities.
The findings also challenge the very notion of "precision medicine" when applied through flawed predictive algorithms. While the goal of tailoring care to individual patient needs is laudable, an overreliance on imprecise tools risks creating new forms of imprecision and harm. The allocation of healthcare resources is also impacted; if interventions are triggered by high false positive rates, valuable time and resources are diverted to individuals who do not need them, while those truly in need might be overlooked due to false negatives.
The Path Forward: Human Oversight and Refined AI
The study does not suggest abandoning AI in healthcare entirely, but rather underscores the critical importance of caution, rigorous validation, and the unwavering necessity of human oversight. The promise of AI in healthcare remains immense, offering potential for efficiency, accuracy, and personalized care. However, this promise can only be realized if these technologies are developed, tested, and implemented with the utmost scientific rigor and ethical consideration.
Moving forward, several key areas require attention:
- Independent Validation: All AI algorithms intended for clinical use, especially those influencing prescribing or diagnosis, must undergo extensive independent validation studies to confirm their accuracy, reliability, and generalizability across diverse patient populations.
- Transparency and Explainability: Healthcare providers need to understand how these algorithms arrive at their conclusions (interpretability and explainability) to integrate them effectively into clinical judgment rather than blindly following their recommendations.
- Differentiating Outcomes: Predictive models should strive to differentiate between distinct outcomes like OUD and overdose, rather than conflating them, to provide more precise and actionable insights.
- Continuous Monitoring and Improvement: AI models are not static; they require continuous monitoring of their performance in real-world settings and iterative improvements based on new data and evolving clinical understanding.
- Human-Centered Design: AI tools should be designed as decision-support systems that augment, rather than replace, human clinical judgment. The ultimate responsibility for patient care must remain with the healthcare provider.
- Patient Engagement: Patients should be informed about the use of AI tools in their care and have avenues to address concerns or challenge classifications they believe are inaccurate.
The pervasive presence of Epic and NarxCare in the U.S. healthcare landscape means that the issues highlighted by this study are not theoretical; they are impacting millions of patient interactions daily. While the intent behind these AI tools is to enhance patient safety and combat a devastating public health crisis, the current evidence suggests that their execution, particularly in generating opioid risk scores, is significantly flawed. The challenge ahead lies in refining these sophisticated technologies, ensuring they truly serve as reliable aids to clinicians, protect patient well-being, and contribute effectively to addressing the complexities of pain management and opioid use disorder. Only through such critical evaluation and commitment to improvement can the full potential of AI be harnessed responsibly within the sensitive domain of healthcare.