Stop Curating Medical Data. Start Making It Available.
Schema-first extraction was opinion encoded as infrastructure. The engineering that matters now is fidelity: guaranteeing the model sees the full record, structured so it can reason through it.

Stop Curating Medical Data. Start Making It Available.

Shere Saidon
Shere Saidon

CEO & Founder at LlamaLab

Published April 9, 2026
Updated April 11, 2026
5 min read
Technology
Part of: AI and Legal Technology

Stop Curating Medical Data. Start Making It Available.

For the last decade, the playbook for unstructured data was the same: build a schema, extract the fields, normalize everything, query the clean version. It made sense when models couldn't read.

But models can read now. GPT-4 hit 99.8% accuracy extracting structured data from unstructured pathology reports in an NIH study: 369 patients, 4,797 data points, 13 variables. GPT-4o hit 91.4% retrieving patient demographics, diagnostics, and pharmacological data from unstructured medical reports. Both on raw clinical text.

And context windows have crossed 1 million tokens. A full patient file fits in a single pass.

So here's the question nobody in data engineering wants to hear: if the model can read the original record better than your extraction pipeline can summarize it, why are you still extracting?

99.8%

Extraction Accuracy

GPT-4 on unstructured pathology reports: 4,797 data points across 13 variables (NEJM AI, 2025)

1M+ tokens

Context Windows

Frontier models now credible at scale, with effective long-context use improving 250x in 9 months (Epoch AI, 2026)

9 of 14

Clinical Tasks Won

Longer-context models beat prior state of the art on real EHR prediction tasks (ICLR, 2025)

The Schema Was Always the Bias

Extract, normalize, model, query. Four stages, each one deciding what matters before the next stage ever sees the data. That felt like rigor. It was opinion, encoded as infrastructure.

A schema is a list of things you already know to look for. ICD codes, provider names, dates of service, prescribed medications: all important, all anticipated.

But the signal that changes outcomes is almost never the anticipated signal. It's the offhand provider note. The inconsistent symptom timeline. The medication history that contradicts the diagnosis three visits later.

When you curate a nurse's note into a structured field, you're encoding their interpretation as fact. The model doesn't need the interpretation. It needs the test results, the vitals, the lab work. Give it the source and let it draw its own conclusions.

Schemas drop that signal on the floor. By design.

And it gets worse when you add LLMs to the extraction pipeline. Research published in Applied Sciences (2026) documents how LLM-assisted knowledge graph construction introduces compounded bias: hallucinated triples like (Drug X, cures, Disease Y) from text that only indicated symptom reduction.

Schema-based paradigms force normalization that overstates evidence and strips nuance before the model ever reasons over the full record.

The Real Challenge

Medical records are long. Thousands of pages across dozens of providers, sometimes spanning decades. Volume used to be the hard problem. But context windows have caught up: frontier models now handle over 1 million tokens in a single pass. A full patient file fits.

The hard problem that remains is temporal reasoning across a patient timeline: connecting a provider note on page 12 to a medication change on page 247 to an outcome documented six months later. Clinical benchmarks are starting to test for this.

Longer context windows improve clinical prediction on real EHR data: a Mamba-based model beat prior state of the art on 9 of 14 tasks (ICLR, 2025). But the same study surfaced three failure modes specific to clinical records that no synthetic benchmark captures.

Important

The failure modes that matter

Copy-forwarded diagnoses that create false signal. Irregular time intervals between visits. Increasing disease complexity over time. These are problems of structure, not volume. No amount of curation fixes them. Fidelity might.

LLMs tested on long clinical documents show strong performance on extraction and sorting, but consistent failure on identifying missing information (LongHealth, 2025). Schema-first extraction has the same blind spot. Your pipeline can't flag what it wasn't built to look for either. Neither approach solves this yet.

Effective long-context use has improved over 250x in nine months (Epoch AI, 2026). The volume problem is solved. The question now is whether the architecture makes the record easy to reason through: page order preserved, visits sequenced, providers attributed, timelines intact.

Engineering problem, yes. Curation problem, no.

From Filtering to Fidelity

Here's what filtering looks like in practice: you take a model that can read raw clinical text and pre-filter what it's allowed to see through an extraction schema.

That's like buying a car with a world-class navigation system and insisting on giving turn-by-turn directions from the passenger seat. You're not helping. You're the bottleneck.

I've been calling the alternative the Full Record Principle. Stop deciding what the model gets to see. Guarantee it sees everything, structured so it can reason through it.

Filtering (Curation)

  • Fields decide what matters. Everything outside the schema is invisible.
  • Hallucinated triples and forced normalization strip nuance before the model sees the record.
  • Data engineering, data modeling, 'context engineering': all different words for curate harder.
  • You can only retrieve what your schema anticipated. The unexpected is structurally unreachable.

Fidelity (Full Record)

  • The full record goes in coherent and intact: page order, attribution, timestamps, temporal sequence.
  • Preserve the coherence the model needs to connect page 12 to page 247.
  • OCR quality, provider attribution, visit sequencing, audit trails. Structure for reasoning, not extraction.
  • Every answer traces back to a specific page. No intermediate extraction layer to drift from.

Where This Leaves Data Engineering

The better question for anyone building on medical data isn't "what fields do you extract?" It's "can the model re-read the actual record when you ask it a new question?" The first question measures how well you curate. The second measures whether the model actually has what it needs.

Frontier models read clinical text better than most extraction pipelines summarize it. The accuracy varies by task, but the gap between "model reads the original" and "pipeline extracts a summary" keeps widening. Not in the pipeline's favor.

Medical records are clinical histories first. The same temporal reasoning that matters for attorneys building a case matters for understanding what actually happened to a patient. Fidelity isn't a legal question or a medical question. It's both.

This Is How We Think About Medical Data

The Full Record Principle is the architecture behind everything we build at LlamaLab. If you're rethinking how AI handles unstructured clinical records, we should talk.

Sources: NEJM AI — LLM-Mediated Data Extraction from Pathology Reports (2025), JMIR AI — LLM Retrieval of Patient Information from Medical Reports (2025), Context Clues — Long Context Models for Clinical Prediction on EHRs (ICLR, 2025), LongHealth — QA Benchmark with Long Clinical Documents (Springer, 2025), MDPI — Analyzing Bias in LLM-Augmented Knowledge Graph Systems (2026), Epoch AI — Context Windows Data Insights (2026).

Stay Updated with Latest Insights

Get the latest articles about medical record retrieval and legal tech delivered to your inbox.