We're excited to announce the early preview release of @flexpa/llm-fhir-eval, an open-source evaluation framework designed to benchmark the performance of Large Language Models (LLMs) on FHIR-specific tasks. This framework aims to establish open benchmarks that make research on FHIR and LLM interactions reproducible.
Recent work, such as FHIR-GPT (Yikuan Li, et al) and HealthSageAI's Note-to-FHIR Llama 2 fine-tune, demonstrate the growing need for reproducible evaluation benchmarks in the FHIR and LLM space. @flexpa/llm-fhir-eval addresses this need by providing a standardized way to measure model performance and behaviors.
Overview
We've started by trying to define an set of tasks to evaluate for the benchmark, included in this preview release:
- FHIR Resource Generation & Validation: Evaluate the ability of LLMs to generate and validate complex FHIR resources.
- Summarization: Assess the proficiency of LLMs in summarizing notes into FHIR resources.
- FHIR Path Evaluation: Test model capabilities in evaluating complex FHIR Path expressions. This is an exciting area of research for us because it was unexpected.
- Structured & Unstructured Data Extraction: Extract specific information from both structured FHIR resources and unstructured clinical notes. This is a very well trodden area of resaerch.
The framework includes implementations of existing research benchmarks, such as the FHIR-GPT paper prompt, providing a foundation for comparative analysis and reproducibility.
Supported Models
The initial release supports evaluation of:
- Anthropic Claude 3.5 Sonnet
- OpenAI GPT-4o
- OpenAI GPT-4o Mini
Community Involvement
Your input is crucial to the development of this framework. We welcome discussion on FHIR Chat about this preview release and in particular:
- Feedback on the evaluation tasks and methodologies
- Suggestions for additional benchmarks
- Contributions to test cases and documentation
- Sharing of evaluation results and experiences
What's Next?
We're focusing on:
- Refining the benchmark based on community feedback
- Implementing prior art and releasing four evaluation tasks for the benchmark
- Designing and obtaining appropriate test cases for the tasks