Hospital’s Clinical AI Tools Are Losing to General-Purpose AI
Health systems are paying for clinical AI tools that don’t outperform general-purpose models already available to physicians, according to a peer-reviewed study in Nature Medicine. This finding has direct implications for every CIO making AI procurement decisions.
Researchers from NYU Langone Health compared two widely adopted clinical AI platforms — OpenEvidence and UpToDate Expert AI — against frontier large language models from OpenAI, Google, and Anthropic across three separate evaluations: medical knowledge, alignment with expert clinicians, and real-world physician queries. Frontier models won every round. Not by a small margin. Decisively.
On medical knowledge questions, Gemini 3.1 Pro answered correctly 97.4% of the time. OpenEvidence came in at 89.6%. UpToDate Expert AI at 88.4%. On a benchmark measuring alignment with expert clinical judgment, GPT-5.2 scored 88.0 against OpenEvidence’s 62.6 and UpToDate’s 61.3. The gap is not closed.
The most compelling test, however, was the real clinical queries benchmark — 100 actual, de-identified questions that physicians submitted to an AI system during routine patient care at NYU Langone. Twelve blinded clinicians rated every model’s responses across clinical correctness, completeness, safety, and clarity. The results produced two distinct performance tiers: frontier models at the top, clinical AI tools at the bottom. Worse, the specialized tools performed no better than Google’s auto-generated AI search results.
The RAG Problem No One Is Advertising
Both OpenEvidence and UpToDate Expert AI likely use retrieval-augmented generation—a technique in which the model draws on a curated clinical knowledge base before responding. The marketing promise is that proprietary data sources make these tools more clinically reliable than general models.
The study raises a different possibility. When irrelevant material is retrieved or poorly integrated, RAG can actively degrade performance. The general-purpose frontier models, meanwhile, benefit from larger training datasets, faster development cycles, and more sophisticated alignment with human feedback. For the types of questions clinicians are actually asking, scale and broad reasoning may simply outweigh domain-specific tuning.
What This Means for CIOs
This study doesn’t say clinical AI tools are unsafe. The researchers found no statistically significant differences in harmful content or hallucination rates across any of the models evaluated. These tools are likely fine for routine use. The problem is the value proposition. Health systems are paying a premium — UpToDate Expert AI runs roughly $699 per clinician annually, while the frontier models are API-priced or already embedded in enterprise agreements — for a product that underperforms on the very dimension it’s supposed to own: clinical knowledge.
Is this a procurement and governance problem? The research notes that clinical AI tools enter medical practice with little independent evaluation. Vendors rarely publish architecture details. Health systems are making significant purchasing decisions without the evidence base they’d require for any other clinical technology.
CIOs: Treat vendor performance claims with real skepticism. Demand independent proof. Build evaluation frameworks that mimic your actual workflows. Compare specialized clinical AI tools against frontier models you’re already licensing. If a tool doesn’t outperform what you have, reconsider procurement. The time for passive adoption is over—your decisions should be data-driven and defensible.
The cost of assuming otherwise is no longer theoretical — it’s sitting in your contract renewals.


