AI Telephone

How well can frontier LLMs convert unstructured clinical text into structured data - and back again?

LLMs are advancing rapidly - cheaper, faster, bigger context windows. But traditional structured systems are here to stay: they run faster, cost less, are better for the environment, produce deterministic outputs, and have clear accountability.

Healthcare exchange standards are the backbone of health data exchange, and LLMs need to work well within that reality. HL7 FHIR® is one such standard we can use for this experiment.

The experiment: we take a clinical scenario as unstructured text and ask an LLM to encode it as FHIR R4 JSON. Then we give that FHIR to the same or a different LLM to reconstruct the original narrative. Three LLM judges compare the output against the original to measure what survived the roundtrip.

How does it work?

1️⃣

Unstructured input

A clinical narrative with patient details, conditions, and medications

2️⃣

Encode to FHIR

LLM converts free text to structured FHIR R4 JSON, validated by HAPI FHIR

3️⃣

Decode back to text

Same or different LLM reconstructs the narrative from the FHIR bundle

4️⃣

Fidelity check

Three LLM judges identify missing or hallucinated information

Type 2 Diabetes - Metformin

February 6, 2026 at 12:36 PM UTC

Scenario: Type 2 Diabetes - Metformin

Patient John Smith is a 45-year-old male who was diagnosed with Type 2 Diabetes Mellitus last month. His physician prescribed Metformin 500mg to be taken twice daily with meals. His latest HbA1c was 7.8%. He has no known drug allergies and his BMI is 28.

How did the LLMs do?

Same LLM Tests

Mistral→Mistral

Reconstructed text

John Smith, a 45-year-old male, was diagnosed with Type 2 Diabetes Mellitus on April 1, 2023. His recent HbA1c level, measured on April 15, 2023, was 7.8%, and his BMI was recorded as 28 kg/m² on the same date. Mr. Smith has been prescribed Metformin 500mg, to be taken orally twice daily with meals. He has no known drug allergies.

Comparison3 issues

Hallucinated:

hba1c measurement date of april 15 2023 fabricated original does not specify when hba1c was measured
specific diagnosis date added diagnosed with type 2 diabetes mellitus on april 1 2023 original only says diagnosed last month
specific bmi recording date added bmi recorded on april 15 2023 original provides no measurement date

Claude→Claude

Reconstructed text

John Smith is a 45-year-old male with no known drug allergies. He has a confirmed diagnosis of type 2 diabetes mellitus without complications, with an onset date of December 1, 2024. He has been prescribed metformin hydrochloride 500 mg oral tablet, to be taken twice daily with meals, for management of his diabetes.

Laboratory results obtained on January 10, 2025, revealed a hemoglobin A1c (HbA1c) of 7.8%, which is elevated above the normal reference range of less than 5.7%. His body mass index (BMI) was recorded at 28 kg/m², which is also above the normal range of 18.5–24.9 kg/m², placing him in the overweight category.

Comparison5 issues

Missing information:

type 2 diabetes mellitus was diagnosed last month relative timing of diagnosis not stated in the reconstruction

Hallucinated:

specific diabetes onset date of december 1 2024 fabricated original only states last month
specific lab result date of january 10 2025 fabricated original provides no date for hba1c
hba1c normal reference range of less than 57 added not in original
bmi normal range of 185249 kgm and overweight category classification added not in original

ChatGPT→ChatGPT

Reconstructed text

John Smith is a male born in 1981. He has no known drug allergies (medication allergy status documented as active and confirmed).

He carries an active, confirmed diagnosis of type 2 diabetes mellitus, with onset documented as 2026-01-01 (problem list item). On 2026-01-01, an active medication order was authored for metformin hydrochloride 500 mg oral tablet: take 500 mg by mouth twice daily with meals (oral route; frequency 2 times per day).

Recent clinical measurements from 2026-02-06 include a hemoglobin A1c (HbA1c) of 7.8%, interpreted as high, and a body mass index (BMI) of 28 kg/m². No appointments or scheduling information are included in the provided records.

Comparison6 issues

Missing information:

age explicitly stated as 45 years old

Hallucinated:

hba1c interpretation as high added not stated in original
birth year 1981 fabricated not stated in original
birth year added born in 1981 which is not provided in the original scenario
specific diabetes onset date documented as 20260101 original only says diagnosed last month
specific date for the metformin medication order authored on 20260101 is added and not provided in the original scenario

Gemini→Gemini

Reconstructed text

John Smith, a male born in 1979, has a confirmed active diagnosis of Type 2 diabetes mellitus without complications. He has no known drug allergies. Recent clinical observations indicate a Body Mass Index (BMI) of 28 kg/m2 and a Hemoglobin A1c level of 7.8%. The patient has an active order for metformin hydrochloride 500 mg, with dosage instructions to take 500 mg twice daily with meals.

Comparison6 issues

Missing information:

diagnosis of type 2 diabetes mellitus was made last month timing of diagnosis not mentioned
patient age stated as 45 years old
that the metformin prescription was prescribed by his physician
diagnosis timing diagnosed last month

Hallucinated:

birth year 1979 added original only states age 45 no specific birth year
birth year added male born in 1979 not stated in the original and may conflict with the stated age of 45 depending on current year

Cross LLM Tests

Gemini→Mistral

Reconstructed text

John Smith, a 45-year-old male, has been diagnosed with Type 2 diabetes mellitus, confirmed in October 2023. His current medication regimen includes metformin hydrochloride 500mg oral tablets, taken twice daily with meals. Recent laboratory observations indicate a Hemoglobin A1c level of 7.8%. Additionally, his Body Mass Index (BMI) is recorded at 28 kg/m². Mr. Smith has no known drug allergies.

Comparison1 issue

Hallucinated:

diagnosis timing changedadded original states type 2 diabetes mellitus was diagnosed last month but reconstruction specifies it was confirmed in october 2023

ChatGPT→Gemini

Reconstructed text

John Smith, a male born in 1980, has no known drug allergies. He has a confirmed, active diagnosis of Type 2 Diabetes Mellitus with an onset of January 2026. To manage this condition, he is prescribed Metformin 500 mg tablets, with instructions to take 500 mg by mouth twice daily with meals. Clinical observations recorded on February 1, 2026, indicate a Hemoglobin A1c level of 7.8% and a Body Mass Index (BMI) of 28 kg/m².

Comparison5 issues

Missing information:

patient age stated as 45 years old reconstruction gives a birth year but does not state the age

Hallucinated:

specific observation date of february 1 2026 fabricated original does not specify a date for clinical observations
specific birth year 1980 fabricated original only states age 45
birth year born in 1980 was added not provided in the original scenario
diabetes onset specified as january 2026 original only states diagnosed last month

Claude→ChatGPT

Reconstructed text

John Smith is a male born in 1980. He has a confirmed, active diagnosis of Type 2 diabetes mellitus (SNOMED 44054006; ICD-10-CM E11), with onset and recording date of 2025-05-01.

For diabetes management, he has an active medication order for metformin hydrochloride 500 mg oral tablet (RxNorm 861004), prescribed for Type 2 diabetes mellitus. The dosing instructions are 500 mg by mouth twice daily with meals (frequency: 2 times per day).

On 2025-06-01, laboratory testing showed Hemoglobin A1c (HbA1c) 7.8%, interpreted as high, with a stated normal reference of < 5.7%. On the same date, his vital signs included BMI 28 kg/m², also interpreted as high, with a reference range of 18.5–24.9 kg/m².

Allergy history documents no known drug allergies (entered as “No known allergy”), with active clinical status and confirmed verification status, categorized under medication. No appointments or scheduled encounters are included in the provided records.

Comparison7 issues

Missing information:

patient explicitly stated as 45yearold reconstruction gives birth year but does not state age

Hallucinated:

birth year 1980 fabricated original only states patient is 45 years old
birth year added born in 1980 which is not provided in the original scenario
specific diabetes onsetrecording date added 20250501 not stated in the original
hba1c interpretation and reference range added interpreted as high normal reference 57 not provided in the original
allergy record attributes added entered as no known allergy with active clinical status confirmed verification status categorized under medication beyond the originals simple statement of no known drug allergies
statement that no appointments or scheduled encounters are included which is not mentioned in the original

Mistral→Claude

Reconstructed text

John Smith is a 47-year-old male with a confirmed, active diagnosis of Type 2 Diabetes Mellitus, with onset on June 1, 2023. He is currently prescribed Metformin 500 mg orally twice daily with meals. Laboratory results obtained on June 15, 2023, revealed a Hemoglobin A1c (HbA1c) of 7.8%, indicating suboptimal glycemic control. His body mass index (BMI) at that time was 28 kg/m², placing him in the overweight category.

Comparison8 issues

Missing information:

no known drug allergies
statement that the patient has no known drug allergies

Hallucinated:

age changed from 45 to 47
specific laboratory result date of june 15 2023 fabricated original does not specify a date for hba1c results
added a specific diabetes onset date of june 1 2023 original only says diagnosed last month
added a specific hba1c lab date of june 15 2023 original gives hba1c value but no date
added interpretation that hba1c 78 indicates suboptimal glycemic control not stated in original
diagnosis onset date june 1 2023 added

Common questions

+What does this mean for clinicians?

AI is already in clinical practice - scribes that document visits, and diagnostic aids like CNNs in radiology are a reality. LLM-based interoperability could be next, helping bridge gaps between disconnected health systems and reducing time spent gathering patient history. LLMs are improving but still make occasional errors.

For now, treat AI-generated summaries as a starting point, not a source of truth. Always verify critical information directly with patients, especially allergies, current medications, and recent procedures.

+What does this mean for policymakers?

AI-driven interoperability could accelerate the connectivity healthcare systems have been striving for decades. The results here show it works best with human oversight for now.

Consider frameworks that allow AI to assist with data exchange while maintaining human oversight, especially for critical information like allergies and medications. Standards bodies may need to develop guidelines for AI-assisted FHIR implementations.

+What does this mean for IT strategists?

AI-assisted interoperability represents a potential paradigm shift from point-to-point integrations to more flexible, adaptive systems. This experiment shows current frontier models can handle FHIR with varying degrees of accuracy.

When planning your roadmap, consider hybrid approaches: AI for initial data mapping and transformation, with validation layers and human review for high-risk data. Monitor how rapidly these models improve - capabilities are improving rapidly.

+What does this mean for developers?

LLMs show great value and potential for healthcare integrations, but beware: the FHIR JSON they generate might pass schema validation while having semantic issues - correct format, wrong meaning. Multiple validation layers beyond schema checks are essential.

Unless the risk is sufficiently small, human oversight is a must. That said, this field is rapidly evolving - just two years ago, we didn't think LLMs could be as capable as they are now. This website is a real-time tracker of their potential in healthcare interoperability.

+What does this mean for investors?

AI-powered healthcare interoperability is a massive market opportunity, but due diligence matters. These results show frontier models can handle FHIR with varying reliability - promising, but not production-ready without safeguards.

Look for startups building validation layers and human-in-the-loop systems, not just raw LLM wrappers. The winners will be those who understand both the potential and the current limitations.

+What does this mean for healthcare executives?

AI could reduce the cost and complexity of health IT integrations that have plagued the industry for decades. However, rushing to adopt immature solutions carries patient safety and liability risks.

Start with lower-risk pilot projects, establish clear governance frameworks, and ensure your teams understand what AI can and cannot reliably do today.

+What does this mean for regulators?

AI-assisted data exchange sits at the intersection of medical device regulation, data privacy, and software certification. Current frameworks may not adequately address LLM-specific risks like hallucination and inconsistent behavior.

Consider requiring transparency about AI involvement in data transformations, and establish standards for validating AI-generated clinical data before it enters the patient record.

+What does this mean for researchers?

This experiment demonstrates how a reproducible benchmark for LLM performance on healthcare data transformation could work. Regular runs create a longitudinal dataset tracking how models improve over time.

Key research questions remain open: How do we measure semantic correctness beyond schema validation in a scalable way? What error rates are acceptable for different clinical contexts? How do we build reliable systems from components with varying accuracy?

+What does this mean for payers?

Better interoperability means better data for risk assessment, care coordination, and fraud detection. AI could accelerate access to complete patient histories across fragmented provider networks.

AI-generated data should be validated before using it for consequential decisions.

+What does this mean for patients?

The promise: your health information could follow you seamlessly between providers, reducing repeated tests and ensuring every doctor knows your full history. The reality: we're not there yet.

As AI becomes more involved in handling your health data, ask your providers how it's being used. You have the right to know, and to request human review of important decisions about your care.

+How does it work under the hood?

Healthcare systems are increasingly starting to use HL7 FHIR® for interoperability - think of it as the JSON of medical data with a standardized RESTful API. It's an open standard that defines how to structure patient records, medications, appointments, and more.

In this experiment, four frontier LLMs are used - the latest models from OpenAI, Anthropic, Google, and Mistral. We give each LLM a clinical scenario and ask it to convert to FHIR - simulating a GenAI agent using a FHIR API. The generated FHIR JSON is validated by HAPI FHIR to ensure it conforms to the specification. Then we give that resulting FHIR structure to either the same or a different LLM and ask it to reconstruct the human narrative.

Finally, three LLM judges independently compare the original scenario with the reconstructed narrative to identify any missing or hallucinated information. Naturally, this is not perfect.

Healthcare can be tricky - some details can be irrelevant (such as the exact day an appointment 3 years ago happened), while others are crucial, such as someone being allergic to a drug.

All scenarios run when models are updated. You decide if the results are good enough for real healthcare decisions.

+What about FHIR R5?

The results with FHIR R5 are far worse, presumably due to less training data available for it. This experiment uses FHIR R4, which is the most widely adopted version and what LLMs have seen the most of during training.

+What prompts are used?

Encoder (narrative → FHIR):

You are a healthcare interoperability expert specializing in FHIR R4.
Your task is to convert clinical narratives into valid FHIR R4 JSON.

Guidelines:
- Output ONLY valid JSON, no explanations or markdown code blocks
- Use FHIR R4 specification
- Include all relevant resources: Patient, Condition, MedicationRequest, Appointment, ServiceRequest, etc.
- Use proper FHIR references between resources
- Include appropriate coding systems (SNOMED CT, RxNorm, ICD-10) where applicable
- Wrap multiple resources in a Bundle with type "collection"
- Ensure all required fields are present for each resource type
- Use valid lowercase UUIDs for fullUrl (e.g., "urn:uuid:a1b2c3d4-e5f6-7890-abcd-ef1234567890")

Example:

Input: "Maria Garcia, 40 year old female, is allergic to penicillin. She was diagnosed with hypertension and prescribed lisinopril 10mg daily."

Output:
{
  "resourceType": "Bundle",
  "type": "collection",
  "entry": [
    {
      "fullUrl": "urn:uuid:a1b2c3d4-e5f6-7890-abcd-ef1234567890",
      "resource": {
        "resourceType": "Patient",
        "name": [{"family": "Garcia", "given": ["Maria"]}],
        "birthDate": "1985"
      }
    },
    {
      "fullUrl": "urn:uuid:b2c3d4e5-f6a7-8901-bcde-f12345678901",
      "resource": {
        "resourceType": "AllergyIntolerance",
        "patient": {"reference": "urn:uuid:a1b2c3d4-e5f6-7890-abcd-ef1234567890"},
        "code": {
          "coding": [{"system": "http://www.nlm.nih.gov/research/umls/rxnorm", "code": "7984", "display": "penicillin V"}]
        },
        "clinicalStatus": {
          "coding": [{"system": "http://terminology.hl7.org/CodeSystem/allergyintolerance-clinical", "code": "active"}]
        }
      }
    },
    {
      "fullUrl": "urn:uuid:c3d4e5f6-a7b8-9012-cdef-123456789012",
      "resource": {
        "resourceType": "Condition",
        "subject": {"reference": "urn:uuid:a1b2c3d4-e5f6-7890-abcd-ef1234567890"},
        "code": {
          "coding": [{"system": "http://snomed.info/sct", "code": "38341003", "display": "Hypertension"}]
        },
        "clinicalStatus": {
          "coding": [{"system": "http://terminology.hl7.org/CodeSystem/condition-clinical", "code": "active"}]
        }
      }
    },
    {
      "fullUrl": "urn:uuid:d4e5f6a7-b8c9-0123-def0-123456789013",
      "resource": {
        "resourceType": "MedicationRequest",
        "subject": {"reference": "urn:uuid:a1b2c3d4-e5f6-7890-abcd-ef1234567890"},
        "medicationCodeableConcept": {
          "coding": [{"system": "http://www.nlm.nih.gov/research/umls/rxnorm", "code": "314076", "display": "Lisinopril 10 MG Oral Tablet"}]
        },
        "status": "active",
        "intent": "order",
        "dosageInstruction": [{"text": "10mg daily", "timing": {"repeat": {"frequency": 1, "period": 1, "periodUnit": "d"}}}]
      }
    }
  ]
}

Output format: Valid FHIR R4 JSON only. Do not wrap in markdown code blocks.

Decoder (FHIR → narrative):

You are a clinical documentation specialist.
Your task is to read FHIR JSON and produce a clear, human-readable clinical narrative.

Guidelines:
- Extract ALL clinical information from the FHIR resources
- Write in natural, professional medical language
- Include patient demographics, conditions, medications, appointments, and all relevant details
- Do not add information not present in the FHIR
- Be precise with dosages, frequencies, and medical terminology
- If there are appointments or scheduling information, include those details

Example:

Input:
{
  "resourceType": "Bundle",
  "type": "collection",
  "entry": [
    {
      "fullUrl": "urn:uuid:a1b2c3d4-e5f6-7890-abcd-ef1234567890",
      "resource": {
        "resourceType": "Patient",
        "name": [{"family": "Garcia", "given": ["Maria"]}],
        "birthDate": "1985"
      }
    },
    {
      "fullUrl": "urn:uuid:b2c3d4e5-f6a7-8901-bcde-f12345678901",
      "resource": {
        "resourceType": "AllergyIntolerance",
        "patient": {"reference": "urn:uuid:a1b2c3d4-e5f6-7890-abcd-ef1234567890"},
        "code": {
          "coding": [{"system": "http://www.nlm.nih.gov/research/umls/rxnorm", "code": "7984", "display": "penicillin V"}]
        },
        "clinicalStatus": {
          "coding": [{"system": "http://terminology.hl7.org/CodeSystem/allergyintolerance-clinical", "code": "active"}]
        }
      }
    },
    {
      "fullUrl": "urn:uuid:c3d4e5f6-a7b8-9012-cdef-123456789012",
      "resource": {
        "resourceType": "Condition",
        "subject": {"reference": "urn:uuid:a1b2c3d4-e5f6-7890-abcd-ef1234567890"},
        "code": {
          "coding": [{"system": "http://snomed.info/sct", "code": "38341003", "display": "Hypertension"}]
        },
        "clinicalStatus": {
          "coding": [{"system": "http://terminology.hl7.org/CodeSystem/condition-clinical", "code": "active"}]
        }
      }
    },
    {
      "fullUrl": "urn:uuid:d4e5f6a7-b8c9-0123-def0-123456789013",
      "resource": {
        "resourceType": "MedicationRequest",
        "subject": {"reference": "urn:uuid:a1b2c3d4-e5f6-7890-abcd-ef1234567890"},
        "medicationCodeableConcept": {
          "coding": [{"system": "http://www.nlm.nih.gov/research/umls/rxnorm", "code": "314076", "display": "Lisinopril 10 MG Oral Tablet"}]
        },
        "status": "active",
        "intent": "order",
        "dosageInstruction": [{"text": "10mg daily", "timing": {"repeat": {"frequency": 1, "period": 1, "periodUnit": "d"}}}]
      }
    }
  ]
}

Output: "Maria Garcia, 40 year old female, is allergic to penicillin. She was diagnosed with hypertension and prescribed lisinopril 10mg daily."

Output format: A clear clinical narrative in paragraph form.

Verifier (comparison):

You are a clinical documentation auditor.
Your task is to compare an original clinical scenario with a reconstructed narrative and identify discrepancies.

Categories of discrepancies:
- MISSING: Information present in the original but absent from the reconstruction
- HALLUCINATED: Information in the reconstruction that is incorrect - either changed from the original (e.g., different dosage, different date) or fabricated entirely (details not in the original)

Be precise and specific. For each issue, state the specific detail.
Do not flag minor wording differences - focus on clinically relevant discrepancies.

Output format: Valid JSON only, no markdown code blocks.

+Who built this?

This experiment was created by Vadim Peretokin, a consultant with 12 years of experience in health IT, specializing in the intersection of LLMs and healthcare interoperability.

Looking to navigate AI strategy, need FHIR training or implementation help, or want a speaker for your next event? Get in touch.