Two days, four bugs, one parity: the autopsy of our medical benchmark

🌐 This is an automatic translation of the original post in Spanish. Some nuances may have been lost along the way.

A seasoned surgeon will tell you without any drama: the dangerous thing is almost never the wound you can see. It’s the one bleeding underneath, quiet, while you stitch up the one on top and pat yourself on the back for a job well done.

Two days ago, on World Inflammatory Bowel Disease Day, we published an honest benchmark of our medical AI assistant. Today I’ll tell you why we nearly published it wrong, and what we found when we decided to cut open the one number that didn’t add up.

This is a post mortem. And like every good post mortem, it starts with a body on the table.

The patient that didn’t fit

The exam was simple to state: 27 questions on IBD pulled from MIRAGE, a test bank for medical AI. A mix of license-exam-style clinical vignettes (the "a patient comes to the clinic with…" kind), biomedical literature from PubMedQA and a curated set of IBD-specific pharmacology.

To score we used RAGAS answer_correctness: a metric from 0 to 1 in which another model acts as judge and compares our answer against the correct reference answer. And here’s a detail that will matter later: it’s a precision metric. It doesn’t reward writing a lot or proving that you know. It rewards saying the right thing without padding.

Almost everything was going reasonably well. Until a scenario showed up that got lodged in us.

A 38-year-old woman, with a prior ileal resection for Crohn’s disease, comes in with biliary colic. The correct answer is choledocholithiasis: a stone stuck in the bile duct. It’s not a trick question; in fact it has a beautiful biomechanical logic, because removing part of someone’s ileum ruins the reabsorption of bile salts and predisposes them precisely to forming stones. The body, once again, telling its own story.

Our system scored 0.382. A bare GPT-4o — the raw model, without our RAG, without our agent pipeline, just it — scored 0.892. A gap of 0.510 points. In a single scenario.

It’s worth stopping here, because this is uncomfortable. We had built an entire architecture around the model — document retrieval, orchestration, personas — and on that question the bare model wiped the floor with us. We had put up scaffolding and the scaffolding weighed more than the building.

The temptation to publish anyway

We could have published the benchmark as-is. One weak scenario out of 27 doesn’t sink an average. You round it off, attach a footnote and nobody asks.

We didn’t. And not out of pride, but because of a rule that’s almost anatomical: a number you can’t defend is a lump you haven’t palpated. It might be nothing. Or it might be the thing that kills you six months later, in production, with a real user in front of you.

So we opened it up. Two days of forensic root-cause analysis. And what we found wasn’t a bug. It was four. In series. And, like layers of tissue, each one covered the one beneath it.

Four bugs, in series, covering for one another

This is the part that deserves detail, because the pattern is more interesting than any of the failures on its own.

Bug 1 — Mojibake in the private use area

A RAG pipeline starts by reading documents. We feed it a medical corpus in PDF, chop it into chunks and store each chunk as a vector — a numerical fingerprint of its meaning — so we can search by similarity.

Two of those PDFs were poisoned. Their text came encoded in Unicode’s Private Use Area: a corner of the standard deliberately left empty, designed so that anyone can stick their own symbols in there. A font with a custom cmap had mapped the letters into that zone. For a human opening the PDF, it read perfectly. For our text extractor, it was mojibake: strings of garbage characters with no meaning whatsoever.

The poisonous thing wasn’t the garbage itself. It was that our embedder — the component that turns text into vectors — didn’t complain. You fed it gibberish and, happy as can be, it placed it in the vector space and started scoring it as "similar" to the questions. Picture an immune system that, instead of flagging a foreign tissue, embraces it and integrates it. The body doesn’t detect the problem; it adopts it.

The fix: a heuristic decoder that detects the private-use-area offset and reverses the mapping, with docling as a safety net when the heuristic falls short.

Bug 2 — The RAG distractor effect

With the first bug fixed, we expected the wound to close. It didn’t. And here something more subtle appeared.

Dense similarity search — comparing vectors — every now and then returned a chunk with a high score, a 0.75, that nevertheless didn’t share a single substantive word with the question. The vector said "this is relevant". The text, read by a human, had nothing to do with it.

It’s the equivalent of a badly calibrated reflex: the stimulus wasn’t the right one, but the reflex arc fires anyway. And an irrelevant chunk slipped into the context is not neutral. It distracts the model. It hands it plausible material to build an elegant, wrong answer.

The fix: a relevance gate, toggleable by environment variable, that discards the context when the similarity score and the lexical overlap both fall below their thresholds — or when the overlap drops below a veto floor, no matter how high the vector score is. If the two signals don’t confirm each other, it doesn’t get in.

Bug 3 — The wrong persona in the wrong room

This is my favorite, because it’s not a code failure. It’s an identity failure.

Our default persona is called Matucha: a companion for people with chronic illness. It’s designed to talk to a real patient — with warmth, with care, reminding them to check with their medical team. For that mission, it’s exactly what it should be.

The problem is that Matucha was also answering the exam-style vignettes. And to a medical licensing board it was responding with empathetic preambles and "consult your healthcare professional" warnings, burying the diagnosis under layers of kindness. The correct answer was in there. Just entombed.

It was the right animal in the wrong habitat. An extraordinary fish we’d asked to climb a tree.

The fix: dispatch by mode. Third-person questions — "A 38-year-old woman comes to the clinic…" — are recognized as academic register and routed through a clinical channel, direct and with no preambles. Questions from an actual patient still reach Matucha, intact.

Bug 4 — The verbosity that dilutes

With the academic channel up and running, one last drip remained. The academic prompt produced answers of 8 or 9 sentences, whereas the bare GPT-4o dispatched in 4 or 5.

And here the detail from the start comes back. RAGAS answer_correctness is a precision metric. Each correct but irrelevant sentence of pathophysiology you add doesn’t add up: it dilutes. It’s like a lab panel with too many markers ordered "just in case" — each extra value contributes no signal, only noise that masks the data that matters.

The fix: tighten the prompt to 3 or 4 sentences at most. Say the right thing, and shut up.

The fifth problem wasn’t a product bug

There’s a fifth finding that deserves a mention of its own, because it’s the most treacherous of all.

Our evaluation orchestrator had a 30-second timeout. And it was cutting off 8 of the 9 exam-type scenarios before they could be measured. That is: for part of the process we were fixing bugs without even being able to see the effect of the fixes, because the test bank was censoring the answers before scoring them.

The important thing: the real user experience was fine the whole time. The product was responding. What was broken was the measuring instrument, not the patient. And that sends a chill, because it’s the error most like a broken thermometer: it doesn’t make you sick, but it makes you take every decision blind.

Where we landed

Two days. Eight pull requests. The same RAGAS judge run over the same subset of 27 IBD scenarios.

The result: 0.310 RAGAS answer_correctness for SynapseFlow, 0.310 for the bare GPT-4o. Parity. And within that overall parity, we win precisely in the buckets where document retrieval really matters: +0.056 in IBD pharmacology and +0.049 in PubMedQA literature.

I want to be clear about what this means and what it doesn’t. It doesn’t mean we’re better. It means that, after debugging four bugs in series, an agentic RAG architecture holds its own against the raw model, and starts to stand out exactly where it’s supposed to add value: when you have to go fetch a piece of data from a document. That’s the honest position. No "state of the art", no superlatives. Parity earned the hard way.

Where are we still behind? In 3 of the 9 exam-type vignettes, where our academic answer was correct but phrased differently from the reference. We suspect it’s judge variance, not a product failure. We’re investigating, and we’ll say so when we know.

What we learned

If you build agentic medical AI for a living, the lesson boils down to one sentence: agent pipelines stack failure modes.

A single out-of-place point of answer_correctness hid four bugs in series, and each one masked the next. You fix the mojibake and the distractor appears. You fix the distractor and the persona appears. You fix the persona and the verbosity appears. It’s the difference between an organism and a loose part: in a system with many layers, no symptom points cleanly to a single cause. You have to dissect.

And the forensic discipline of refusing to publish a number you can’t defend is not a formality. It’s the whole game.

The full report — with the scenario-by-scenario detail, all the baseline JSON and the audit trail of the 8 PRs — is here: github.com/DEUS-AI/SynapseFlow → docs/benchmarks/eje1-ibd-baseline.md.

What’s next: the same 27 scenarios against KAG (Liang et al., 2024). That’s the comparison this whole autopsy was buying us the right to make with credibility. We’ll tell it when we have the numbers. Not before.

Next post: SynapseFlow vs. KAG — the comparison this autopsy made possible.