Microsoft’s headline number is striking: in a new diagnostic benchmark, its experimental MAI Diagnostic Orchestrator, or MAI-DxO, correctly solved up to 85.5 percent of 304 difficult medical cases adapted from the New England Journal of Medicine. The physicians in the same benchmark averaged 20 percent.

This is not medical advice. It is a report on an AI benchmark, not evidence that anyone should use a chatbot or experimental system to diagnose themselves.

The finding is worth taking seriously, but it should not be read as the final word. The work appears as the arXiv preprint Sequential Diagnosis with Language Models, submitted in late June 2025 and revised on July 2. The authors state that they are submitting the work for external peer review. That matters because the result is large enough to deserve attention, and constrained enough to deserve caution.

What Microsoft actually tested

The benchmark was not a simple multiple-choice medical exam. The Microsoft-affiliated team built what it calls the Sequential Diagnosis Benchmark, or SDBench, from 304 diagnostically challenging NEJM clinicopathological conference cases. These are not routine sore throat or ankle pain visits. They are difficult, often unusual cases designed to teach clinical reasoning.

Each case was turned into a stepwise diagnostic encounter. A human physician or AI system began with a short case abstract. It could then ask questions, request examination findings, order tests and eventually commit to a final diagnosis. A “gatekeeper” model held the full case file and revealed information only when explicitly requested.

That structure is important. Most earlier AI medical benchmarks gave a model the whole vignette at once, sometimes with answer options. This benchmark tried to make the agent do part of the work that clinicians actually do: decide what information is worth seeking, what tests are justified, and when enough evidence has accumulated to make a call.

Performance was measured on two axes: diagnostic accuracy and estimated cost of tests and visits. That second axis is one reason the benchmark is interesting. A system that reaches the right diagnosis only by ordering everything would not be a good model of medical practice, even if it scored well on accuracy.

What MAI-DxO is

MAI-DxO is not simply one language model answering a prompt. The paper describes it as a model-agnostic orchestrator. It can sit on top of systems from OpenAI, Gemini, Claude, Grok, DeepSeek and Llama families, asking them to work through diagnosis in a more structured way.

The design simulates a panel of physicians with different roles. It proposes differential diagnoses, decides which questions or tests might be useful, considers costs, and revises its position as information comes in. In the strongest configuration reported in the paper, MAI-DxO paired with OpenAI’s o3 model reached 85.5 percent accuracy. In a cost-conscious configuration, the same pairing reached about 80 percent while reducing estimated diagnostic costs compared with both physicians and off-the-shelf o3.

The comparison with physicians is the part most likely to be misread. The paper says a cohort of US and UK physicians had a median of 12 years of experience and averaged 20 percent accuracy on SDBench. But those physicians were recruited as medical generalists, not specialist teams. They were also asked not to use search engines, colleagues, textbooks or AI tools, partly to prevent them from finding the original NEJM cases online. The paper itself calls the physician comparison a first-order approximation.

That does not make the result meaningless. It means the number is a benchmark result under defined conditions, not a demonstration that an AI system is ready to replace doctors in a hospital.

Why the cases were so hard

The NEJM clinicopathological conference format is deliberately demanding. These cases often involve ambiguous symptoms, long diagnostic paths, rare diseases, overlapping possibilities and the slow accumulation of evidence. They are built for teaching, not for representing the ordinary prevalence of disease in a clinic.

The authors acknowledge that limitation. Because SDBench is built from complex, curated cases, it does not match a real-world deployment setting. The paper notes that there were no cases where the patient was healthy or had a benign syndrome, so the benchmark could not measure false positives in routine practice. It also could not capture patient discomfort, test availability, reimbursement barriers, local protocols, urgency, waiting times or the practical risk of invasive procedures.

Those omissions matter. In real medicine, a test is not merely a line item with a price. It can be painful, delayed, unavailable, risky, frightening or inappropriate for a particular patient. A good diagnosis is not only a correct answer. It is a sequence of decisions made with the patient in front of you.

Still, the benchmark points to a real weakness in many AI evaluations. Medicine is not mostly about recalling one answer after reading a neat paragraph. It is about deciding what to ask next. MAI-DxO’s advantage came from making that process more explicit.

The cost claim needs careful handling

The preprint reports that MAI-DxO can reduce estimated diagnostic costs as well as improve accuracy. In one comparison, the paper says MAI-DxO paired with o3 achieved 79.9 percent accuracy at an average estimated cost of $2,397 per case, while off-the-shelf o3 reached 78.6 percent at $7,850. The physician cohort averaged $2,963 per case.

These are simulated costs, not hospital bills. The authors used US cost estimates for ordered tests, but real costs vary across health systems, insurers, geography and hospital contracts. The benchmark also leaves out other costs, such as clinician time, patient travel, imaging availability and downstream consequences of a false lead.

That does not erase the point. The more interesting claim is that an AI system can be trained or orchestrated to consider the value of information, not just the probability of being right. In a health system, diagnostic excess can be expensive and harmful. The promise of a system like MAI-DxO is not merely that it may know more diseases. It is that it may help reason about what evidence is worth gathering.

What outside reporting added

Wired reported that Microsoft framed the work as a step toward “medical superintelligence,” while also noting expert caution about the physician comparison and the need for clinical trials. The Guardian similarly reported that Microsoft does not describe the system as ready for clinical use, and that further testing is needed, including on more common symptoms.

That is the right place to leave the claim for now. The benchmark is more sophisticated than a static quiz, and the reported accuracy gap is large. But the system has not been tested as a deployed clinical decision-support tool, where messy records, anxious patients, incomplete histories, local constraints and liability all become part of the diagnostic environment.

There is also the question of trust. Doctors do more than produce diagnoses. They explain uncertainty, notice non-verbal cues, negotiate risk, understand family context, and decide when a technically available test is not the humane or practical next step. AI may assist some of that work, but a benchmark of final diagnoses does not measure the whole clinical role.

The important signal

MAI-DxO is best understood as a signal about where medical AI is moving. The frontier is shifting from “can a model answer a medical question?” to “can a system gather evidence, revise hypotheses and use resources judiciously?” That is a more serious test.

If the result holds up under peer review and later clinical validation, systems like MAI-DxO could become useful second readers for complex cases, especially in places where specialist expertise is scarce. They could help clinicians widen a differential diagnosis, identify a high-yield test, or avoid an expensive dead end.

But the phrase “up to 85.5 percent” should carry all of its conditions with it. It means maximum-accuracy performance on 304 difficult simulated NEJM cases, using an orchestrated AI system under benchmark rules. It does not mean 85.5 percent accuracy across medicine, in ordinary clinics, with real patients, or without human oversight.

The sober version is still significant. Microsoft has shown that when language models are organised to reason sequentially, they can perform much better than a one-shot medical chatbot. The next question is not whether that result sounds impressive. It is whether it survives the harder test of medicine outside the benchmark.