Blog/Platform Updates

LLM FHIR Eval Preview

An open-source evaluation framework designed to benchmark the performance of Large Language Models (LLMs) on FHIR-specific tasks, making research on FHIR and LLM interactions reproducible.

November 25, 2024•Joshua Kelly

We're excited to announce the early preview release of @flexpa/llm-fhir-eval, an open-source evaluation framework designed to benchmark the performance of Large Language Models (LLMs) on FHIR-specific tasks. This framework aims to establish open benchmarks that make research on FHIR and LLM interactions reproducible.

Recent work, such as FHIR-GPT (Yikuan Li, et al) and HealthSageAI's Note-to-FHIR Llama 2 fine-tune, demonstrate the growing need for reproducible evaluation benchmarks in the FHIR and LLM space. @flexpa/llm-fhir-eval addresses this need by providing a standardized way to measure model performance and behaviors.

Overview

We've started by trying to define an set of tasks to evaluate for the benchmark, included in this preview release:

FHIR Resource Generation & Validation: Evaluate the ability of LLMs to generate and validate complex FHIR resources.
Summarization: Assess the proficiency of LLMs in summarizing notes into FHIR resources.
FHIR Path Evaluation: Test model capabilities in evaluating complex FHIR Path expressions. This is an exciting area of research for us because it was unexpected.
Structured & Unstructured Data Extraction: Extract specific information from both structured FHIR resources and unstructured clinical notes. This is a very well trodden area of resaerch.

The framework includes implementations of existing research benchmarks, such as the FHIR-GPT paper prompt, providing a foundation for comparative analysis and reproducibility.

Supported Models

The initial release supports evaluation of:

Anthropic Claude 3.5 Sonnet
OpenAI GPT-4o
OpenAI GPT-4o Mini

Community Involvement

Your input is crucial to the development of this framework. We welcome discussion on FHIR Chat about this preview release and in particular:

Feedback on the evaluation tasks and methodologies
Suggestions for additional benchmarks
Contributions to test cases and documentation
Sharing of evaluation results and experiences

What's Next?

We're focusing on:

Refining the benchmark based on community feedback
Implementing prior art and releasing four evaluation tasks for the benchmark
Designing and obtaining appropriate test cases for the tasks

In this blog

Overview

Supported Models

Community Involvement

What's Next?

More platform updates

View All

Understanding Flexpa's FHIR Transforms - Enhancing Healthcare Data Interoperability

Dive into how Flexpa's transform pipeline standardizes healthcare data from diverse payer sources, ensuring consistent, high-quality FHIR resources for developers.

March 24, 2025•Joshua Kelly

Launching NPI Enrichment and Provider Directory

Announcing two powerful new features - automatic NPI enrichment for claims data and a comprehensive Provider Directory API powered by NPPES data

March 31, 2025•Joshua Kelly

Flexpa Adds VA Connectivity to our Network

Flexpa is now connected to the Department of Veterans Affairs, serving 9.1 million veterans through secure, patient-authorized access to claims records

April 7, 2025•Flexpa Team

Get fresh insights on patient access

Unsubscribe anytime

Get fresh insights on patient access

Unsubscribe anytime