Evaluating Model-Level Performance in GenAI Applications

As life sciences organizations increasingly explore the potential of Generative AI in transforming their workflows, effectively evaluating different aspects of performance for various Generative AI applications becomes critical. Many aspects of these applications can be assessed, but one that is often only assessed in a highly manual fashion at present is model choice. As organizations work to identify what Generative AI models they want to leverage, they find themselves manually switching off models or not looking to optimize at the model level at all.

Introduction

While this is likely sufficient in the short term, as organizations look to grow Generative AI capabilities to new use cases and across larger sets of users, at least some level of model optimization is likely to come into play. Given the complexity and “black-box” nature of many models, it can be difficult to determine the best one to use other than via trial-and-error. And even so, the probabilistic nature of LLMs makes it difficult to understand how a certain use of a Generative AI model will scale in terms of reliability and accuracy.

Understanding Uncertainty in LLMs

One dimension of LLM selection and performance that is becoming increasingly important is model “uncertainty.” In the context of this blog, uncertainty refers to the degree of confidence in the outputs generated by an LLM. High uncertainty indicates a lower confidence level in the accuracy or relevance of the output, which is crucial in healthcare, where decisions based on uncertain data can have significant consequences. This is different from traditional LLM evaluation, which often focuses solely on accuracy, i.e. whether the model's answers are correct or not. Here, we emphasize understanding and quantifying how uncertain the model is about its answers. This distinction is important because the overall performance of an LLM-based system can be influenced by various components, such as the retrieval-augmented generation (RAG) system or other integrated tools, and not just the LLM itself. Therefore, accurate uncertainty estimation helps in identifying when the model's output can be trusted or when human intervention might be needed.

This blog post conceptually outlines three approaches to evaluating Generative AI model uncertainty and suggests tailored strategies for implementing Large Language Models (LLMs) in healthcare, based on insights from recent research.

Approach 1: Shifting Attention to Relevance

Shifting Attention to Relevance (SAR) [1] aims to improve uncertainty quantification in LLMs. This approach addresses the issue of "generative inequality," where not all tokens in a generated response equally represent the underlying meaning. Traditional methods of uncertainty quantification often treat all tokens equally, which can lead to misleading estimates. The SAR method focuses on identifying and emphasizing the most semantically relevant tokens and sentences in a model's output. By shifting attention to these key components, SAR can provide more accurate and reliable uncertainty estimates.

Consider the question: "What is the ratio of the mass of an object to its volume?"

Ground Truth Answer: Density

Traditional Uncertainty Quantification Method (Predictive Entropy)

LLM generates: "density of an object"

Entropy calculation for each token might be high even for irrelevant tokens like "of" and "an," leading to high overall uncertainty: 0.238 + 6.528 + 0.966 + 0.008 = 7.74 (High uncertainty).

SAR Method

Calculate the relevance of each token and Reweight the entropy calculation: relevant tokens like "density" get higher weight, and less relevant tokens like "of" get lower weight.

Adjusted entropy might be: 0.757 + 0.057 + 0.097 + 0.088 = 0.999 (Low uncertainty).

Approach 2: Supervised Learning for Uncertainty Estimation

Another approach [2] leverages supervised learning to calibrate uncertainty estimates. This involves creating a labeled dataset where each model output is scored based on its correctness relative to a true response. Features extracted from the model's internal states (e.g., hidden layer activations) are then used to train a supervised model to predict these scores. This method, applicable to both white-box and black-box LLMs, helps understand how well the model's internal representations capture uncertainty.

Approach 3: BLoB - Bayesianization for Fine-Tuning LLMs

The Bayesianized Low-Rank Adaptation (BLoB) [3] approach combines Bayesian methods with parameter-efficient fine-tuning techniques to improve uncertainty estimation. BLoB allows for simultaneous estimation of both the mean and covariance of LLM parameters during fine-tuning, enhancing the model's reliability and generalization. This approach addresses the limitations of traditional fine-tuning methods, which often fail to maintain accurate uncertainty estimates after domain-specific training.

Practical Recommendations for Healthcare

Data Integration and Workflow Compatibility: Evaluating how well an LLM integrates with existing healthcare data systems and workflows is crucial. Using redacted or publicly available datasets, such as those from clinicaltrials.gov or Drugs@FDA, can simulate real-world performance. The tool must seamlessly fit into existing electronic health record systems, supporting clinicians in their decision-making processes without adding significant steps.

Ongoing Calibration and Monitoring: Given the probabilistic nature of LLMs, continuous calibration and monitoring are necessary to maintain performance. Implementing feedback loops where SMEs and other end users can report inaccuracies and provide corrections can help refine the model over time.

References:

[1] Duan, Jinhao, et al. "Shifting attention to relevance: Towards the uncertainty estimation of large language models." arXiv preprint arXiv:2307.01379 (2023). Link: https://arxiv.org/pdf/2307.01379

[2] Liu, Linyu, et al. "Uncertainty Estimation and Quantification for LLMs: A Simple Supervised Approach." arXiv preprint arXiv:2404.15993 (2024). Link: https://arxiv.org/pdf/2404.15993

[3] Wang, Yibin, et al. "BLoB: Bayesian Low-Rank Adaptation by Backpropagation for Large Language Models." arXiv preprint arXiv:2406.11675 (2024). Link: https://arxiv.org/pdf/2406.11675