📞

AI Telephone

LLM Interoperability Demo

AI Telephone

How well can frontier LLMs convert unstructured clinical text into structured data - and back again?

LLMs are advancing rapidly - cheaper, faster, bigger context windows. But traditional structured systems are here to stay: they run faster, cost less, are better for the environment, produce deterministic outputs, and have clear accountability.

Healthcare exchange standards are the backbone of health data exchange, and LLMs need to work well within that reality. HL7 FHIR® is one such standard we can use for this experiment.

The experiment: we take a clinical scenario as unstructured text and ask an LLM to encode it as FHIR R4 JSON. Then we give that FHIR to the same or a different LLM to reconstruct the original narrative. Three LLM judges compare the output against the original to measure what survived the roundtrip.

How does it work?

1️⃣

Unstructured input

A clinical narrative with patient details, conditions, and medications

2️⃣

Encode to FHIR

LLM converts free text to structured FHIR R4 JSON, validated by HAPI FHIR

3️⃣

Decode back to text

Same or different LLM reconstructs the narrative from the FHIR bundle

4️⃣

Fidelity check

Three LLM judges identify missing or hallucinated information

Common questions

+What does this mean for clinicians?

AI is already in clinical practice - scribes that document visits, and diagnostic aids like CNNs in radiology are a reality. LLM-based interoperability could be next, helping bridge gaps between disconnected health systems and reducing time spent gathering patient history. LLMs are improving but still make occasional errors.

For now, treat AI-generated summaries as a starting point, not a source of truth. Always verify critical information directly with patients, especially allergies, current medications, and recent procedures.

+What does this mean for policymakers?

AI-driven interoperability could accelerate the connectivity healthcare systems have been striving for decades. The results here show it works best with human oversight for now.

Consider frameworks that allow AI to assist with data exchange while maintaining human oversight, especially for critical information like allergies and medications. Standards bodies may need to develop guidelines for AI-assisted FHIR implementations.

+What does this mean for IT strategists?

AI-assisted interoperability represents a potential paradigm shift from point-to-point integrations to more flexible, adaptive systems. This experiment shows current frontier models can handle FHIR with varying degrees of accuracy.

When planning your roadmap, consider hybrid approaches: AI for initial data mapping and transformation, with validation layers and human review for high-risk data. Monitor how rapidly these models improve - capabilities are improving rapidly.

+What does this mean for developers?

LLMs show great value and potential for healthcare integrations, but beware: the FHIR JSON they generate might pass schema validation while having semantic issues - correct format, wrong meaning. Multiple validation layers beyond schema checks are essential.

Unless the risk is sufficiently small, human oversight is a must. That said, this field is rapidly evolving - just two years ago, we didn't think LLMs could be as capable as they are now. This website is a real-time tracker of their potential in healthcare interoperability.

+What does this mean for investors?

AI-powered healthcare interoperability is a massive market opportunity, but due diligence matters. These results show frontier models can handle FHIR with varying reliability - promising, but not production-ready without safeguards.

Look for startups building validation layers and human-in-the-loop systems, not just raw LLM wrappers. The winners will be those who understand both the potential and the current limitations.

+What does this mean for healthcare executives?

AI could reduce the cost and complexity of health IT integrations that have plagued the industry for decades. However, rushing to adopt immature solutions carries patient safety and liability risks.

Start with lower-risk pilot projects, establish clear governance frameworks, and ensure your teams understand what AI can and cannot reliably do today.

+What does this mean for regulators?

AI-assisted data exchange sits at the intersection of medical device regulation, data privacy, and software certification. Current frameworks may not adequately address LLM-specific risks like hallucination and inconsistent behavior.

Consider requiring transparency about AI involvement in data transformations, and establish standards for validating AI-generated clinical data before it enters the patient record.

+What does this mean for researchers?

This experiment demonstrates how a reproducible benchmark for LLM performance on healthcare data transformation could work. Regular runs create a longitudinal dataset tracking how models improve over time.

Key research questions remain open: How do we measure semantic correctness beyond schema validation in a scalable way? What error rates are acceptable for different clinical contexts? How do we build reliable systems from components with varying accuracy?

+What does this mean for payers?

Better interoperability means better data for risk assessment, care coordination, and fraud detection. AI could accelerate access to complete patient histories across fragmented provider networks.

AI-generated data should be validated before using it for consequential decisions.

+What does this mean for patients?

The promise: your health information could follow you seamlessly between providers, reducing repeated tests and ensuring every doctor knows your full history. The reality: we're not there yet.

As AI becomes more involved in handling your health data, ask your providers how it's being used. You have the right to know, and to request human review of important decisions about your care.

+How does it work under the hood?

Healthcare systems are increasingly starting to use HL7 FHIR® for interoperability - think of it as the JSON of medical data with a standardized RESTful API. It's an open standard that defines how to structure patient records, medications, appointments, and more.

In this experiment, four frontier LLMs are used - the latest models from OpenAI, Anthropic, Google, and Mistral. We give each LLM a clinical scenario and ask it to convert to FHIR - simulating a GenAI agent using a FHIR API. The generated FHIR JSON is validated by HAPI FHIR to ensure it conforms to the specification. Then we give that resulting FHIR structure to either the same or a different LLM and ask it to reconstruct the human narrative.

Finally, three LLM judges independently compare the original scenario with the reconstructed narrative to identify any missing or hallucinated information. Naturally, this is not perfect.

Healthcare can be tricky - some details can be irrelevant (such as the exact day an appointment 3 years ago happened), while others are crucial, such as someone being allergic to a drug.

All scenarios run when models are updated. You decide if the results are good enough for real healthcare decisions.

+What about FHIR R5?

The results with FHIR R5 are far worse, presumably due to less training data available for it. This experiment uses FHIR R4, which is the most widely adopted version and what LLMs have seen the most of during training.

+What prompts are used?

Encoder (narrative → FHIR):

You are a healthcare interoperability expert specializing in FHIR R4.
Your task is to convert clinical narratives into valid FHIR R4 JSON.

Guidelines:
- Output ONLY valid JSON, no explanations or markdown code blocks
- Use FHIR R4 specification
- Include all relevant resources: Patient, Condition, MedicationRequest, Appointment, ServiceRequest, etc.
- Use proper FHIR references between resources
- Include appropriate coding systems (SNOMED CT, RxNorm, ICD-10) where applicable
- Wrap multiple resources in a Bundle with type "collection"
- Ensure all required fields are present for each resource type
- Use valid lowercase UUIDs for fullUrl (e.g., "urn:uuid:a1b2c3d4-e5f6-7890-abcd-ef1234567890")

Example:

Input: "Maria Garcia, 40 year old female, is allergic to penicillin. She was diagnosed with hypertension and prescribed lisinopril 10mg daily."

Output:
{
  "resourceType": "Bundle",
  "type": "collection",
  "entry": [
    {
      "fullUrl": "urn:uuid:a1b2c3d4-e5f6-7890-abcd-ef1234567890",
      "resource": {
        "resourceType": "Patient",
        "name": [{"family": "Garcia", "given": ["Maria"]}],
        "birthDate": "1985"
      }
    },
    {
      "fullUrl": "urn:uuid:b2c3d4e5-f6a7-8901-bcde-f12345678901",
      "resource": {
        "resourceType": "AllergyIntolerance",
        "patient": {"reference": "urn:uuid:a1b2c3d4-e5f6-7890-abcd-ef1234567890"},
        "code": {
          "coding": [{"system": "http://www.nlm.nih.gov/research/umls/rxnorm", "code": "7984", "display": "penicillin V"}]
        },
        "clinicalStatus": {
          "coding": [{"system": "http://terminology.hl7.org/CodeSystem/allergyintolerance-clinical", "code": "active"}]
        }
      }
    },
    {
      "fullUrl": "urn:uuid:c3d4e5f6-a7b8-9012-cdef-123456789012",
      "resource": {
        "resourceType": "Condition",
        "subject": {"reference": "urn:uuid:a1b2c3d4-e5f6-7890-abcd-ef1234567890"},
        "code": {
          "coding": [{"system": "http://snomed.info/sct", "code": "38341003", "display": "Hypertension"}]
        },
        "clinicalStatus": {
          "coding": [{"system": "http://terminology.hl7.org/CodeSystem/condition-clinical", "code": "active"}]
        }
      }
    },
    {
      "fullUrl": "urn:uuid:d4e5f6a7-b8c9-0123-def0-123456789013",
      "resource": {
        "resourceType": "MedicationRequest",
        "subject": {"reference": "urn:uuid:a1b2c3d4-e5f6-7890-abcd-ef1234567890"},
        "medicationCodeableConcept": {
          "coding": [{"system": "http://www.nlm.nih.gov/research/umls/rxnorm", "code": "314076", "display": "Lisinopril 10 MG Oral Tablet"}]
        },
        "status": "active",
        "intent": "order",
        "dosageInstruction": [{"text": "10mg daily", "timing": {"repeat": {"frequency": 1, "period": 1, "periodUnit": "d"}}}]
      }
    }
  ]
}

Output format: Valid FHIR R4 JSON only. Do not wrap in markdown code blocks.

Decoder (FHIR → narrative):

You are a clinical documentation specialist.
Your task is to read FHIR JSON and produce a clear, human-readable clinical narrative.

Guidelines:
- Extract ALL clinical information from the FHIR resources
- Write in natural, professional medical language
- Include patient demographics, conditions, medications, appointments, and all relevant details
- Do not add information not present in the FHIR
- Be precise with dosages, frequencies, and medical terminology
- If there are appointments or scheduling information, include those details

Example:

Input:
{
  "resourceType": "Bundle",
  "type": "collection",
  "entry": [
    {
      "fullUrl": "urn:uuid:a1b2c3d4-e5f6-7890-abcd-ef1234567890",
      "resource": {
        "resourceType": "Patient",
        "name": [{"family": "Garcia", "given": ["Maria"]}],
        "birthDate": "1985"
      }
    },
    {
      "fullUrl": "urn:uuid:b2c3d4e5-f6a7-8901-bcde-f12345678901",
      "resource": {
        "resourceType": "AllergyIntolerance",
        "patient": {"reference": "urn:uuid:a1b2c3d4-e5f6-7890-abcd-ef1234567890"},
        "code": {
          "coding": [{"system": "http://www.nlm.nih.gov/research/umls/rxnorm", "code": "7984", "display": "penicillin V"}]
        },
        "clinicalStatus": {
          "coding": [{"system": "http://terminology.hl7.org/CodeSystem/allergyintolerance-clinical", "code": "active"}]
        }
      }
    },
    {
      "fullUrl": "urn:uuid:c3d4e5f6-a7b8-9012-cdef-123456789012",
      "resource": {
        "resourceType": "Condition",
        "subject": {"reference": "urn:uuid:a1b2c3d4-e5f6-7890-abcd-ef1234567890"},
        "code": {
          "coding": [{"system": "http://snomed.info/sct", "code": "38341003", "display": "Hypertension"}]
        },
        "clinicalStatus": {
          "coding": [{"system": "http://terminology.hl7.org/CodeSystem/condition-clinical", "code": "active"}]
        }
      }
    },
    {
      "fullUrl": "urn:uuid:d4e5f6a7-b8c9-0123-def0-123456789013",
      "resource": {
        "resourceType": "MedicationRequest",
        "subject": {"reference": "urn:uuid:a1b2c3d4-e5f6-7890-abcd-ef1234567890"},
        "medicationCodeableConcept": {
          "coding": [{"system": "http://www.nlm.nih.gov/research/umls/rxnorm", "code": "314076", "display": "Lisinopril 10 MG Oral Tablet"}]
        },
        "status": "active",
        "intent": "order",
        "dosageInstruction": [{"text": "10mg daily", "timing": {"repeat": {"frequency": 1, "period": 1, "periodUnit": "d"}}}]
      }
    }
  ]
}

Output: "Maria Garcia, 40 year old female, is allergic to penicillin. She was diagnosed with hypertension and prescribed lisinopril 10mg daily."

Output format: A clear clinical narrative in paragraph form.

Verifier (comparison):

You are a clinical documentation auditor.
Your task is to compare an original clinical scenario with a reconstructed narrative and identify discrepancies.

Categories of discrepancies:
- MISSING: Information present in the original but absent from the reconstruction
- HALLUCINATED: Information in the reconstruction that is incorrect - either changed from the original (e.g., different dosage, different date) or fabricated entirely (details not in the original)

Be precise and specific. For each issue, state the specific detail.
Do not flag minor wording differences - focus on clinically relevant discrepancies.

Output format: Valid JSON only, no markdown code blocks.
+Who built this?

This experiment was created by Vadim Peretokin, a consultant with 12 years of experience in health IT, specializing in the intersection of LLMs and healthcare interoperability.

Looking to navigate AI strategy, need FHIR training or implementation help, or want a speaker for your next event? Get in touch.