Blog/Platform Updates

LLM FHIR Eval Preview

An open-source evaluation framework designed to benchmark the performance of Large Language Models (LLMs) on FHIR-specific tasks, making research on FHIR and LLM interactions reproducible.

November 25, 2024Joshua Kelly
LLM FHIR Eval Preview

We're excited to announce the early preview release of @flexpa/llm-fhir-eval, an open-source evaluation framework designed to benchmark the performance of Large Language Models (LLMs) on FHIR-specific tasks. This framework aims to establish open benchmarks that make research on FHIR and LLM interactions reproducible.

Recent work, such as FHIR-GPT (Yikuan Li, et al) and HealthSageAI's Note-to-FHIR Llama 2 fine-tune, demonstrate the growing need for reproducible evaluation benchmarks in the FHIR and LLM space. @flexpa/llm-fhir-eval addresses this need by providing a standardized way to measure model performance and behaviors.

Overview

We've started by trying to define an set of tasks to evaluate for the benchmark, included in this preview release:

  1. FHIR Resource Generation & Validation: Evaluate the ability of LLMs to generate and validate complex FHIR resources.
  2. Summarization: Assess the proficiency of LLMs in summarizing notes into FHIR resources.
  3. FHIR Path Evaluation: Test model capabilities in evaluating complex FHIR Path expressions. This is an exciting area of research for us because it was unexpected.
  4. Structured & Unstructured Data Extraction: Extract specific information from both structured FHIR resources and unstructured clinical notes. This is a very well trodden area of resaerch.

The framework includes implementations of existing research benchmarks, such as the FHIR-GPT paper prompt, providing a foundation for comparative analysis and reproducibility.

Supported Models

The initial release supports evaluation of:

  • Anthropic Claude 3.5 Sonnet
  • OpenAI GPT-4o
  • OpenAI GPT-4o Mini

Community Involvement

Your input is crucial to the development of this framework. We welcome discussion on FHIR Chat about this preview release and in particular:

  • Feedback on the evaluation tasks and methodologies
  • Suggestions for additional benchmarks
  • Contributions to test cases and documentation
  • Sharing of evaluation results and experiences

What's Next?

We're focusing on:

  • Refining the benchmark based on community feedback
  • Implementing prior art and releasing four evaluation tasks for the benchmark
  • Designing and obtaining appropriate test cases for the tasks

Get fresh insights on patient access

Unsubscribe anytime

Newsletter illustration