RELAI Sets New State-of-the-Art for LLM Hallucination Detection

Nov 13, 2024

By Wenxiao Wang, Siddhant Bharti, Priyatham Kattakinda, and Soheil Feizi

Summary

Performance of various hallucination detection agents on GPT-4o responses in OpenAI’s SimpleQA dataset. RELAI establishes a new state-of-the-art for LLM verification. You can try RELAI’s real-time verification agents at relai.ai

SimpleQA Dataset: OpenAI has recently released a new fact-based dataset revealing high hallucination rates in top LLMs such as GPT-4o and Claude-3.5-Sonnet.
RELAI’s Verification Agents: These specialized agents automatically detect and flag hallucinations in LLM outputs in real-time time.
RELAI establishes the new state-of-the-art in hallucination detection: For GPT-4o, RELAI achieves a 76.5% detection rate at a 5% false positive rate and a 28.6% detection rate at a 0% false positive rate. RELAI outperforms existing baselines by significant margins.
Try it out yourself: RELAI agents are accessible for individual and enterprise users on relai.ai.

Introduction to SimpleQA Dataset

On October 30, 2024, OpenAI has recently released the SimpleQA dataset offering a robust benchmark for evaluating factuality in short, fact-seeking queries. SimpleQA focuses on short-answer questions across diverse topics. This dataset minimizes ambiguity and has undergone thorough quality checks, making it an ideal testing ground for detecting “hallucinations”—incorrect or fabricated answers from language models.

Here is an example sample from the dataset:

Prompt:: How many times did Bil Keane win Best Syndicated Panel by the National Cartoonists Society's Award?

Ground truth:: four times

High Rate of Hallucinations of top LLMs in SimpleQA

Although this dataset is titled “Simple”QA, it seems there is nothing simple about it even for top LLMs. In fact, OpenAI’s analysis available here has shown that the top LLMs all struggle on this dataset leading to high rates of hallucinations.

In our analysis, we focus on two top LLMs: GPT-4o and Claude-3.5-Sonnet (ver. 20241022) and evaluate them on 200 prompts randomly sampled from the dataset. Here is a table demonstrating their accuracy:

Base LLM	Accuracy	Refusal Rate	Hallucination Rate
GPT-4o	39.5%	1.0%	59.5%
Claude-3.5-Sonnet	29.5%	36.0%	34.5%

Performance of GPT-4o and Claude-3.5-Sonnet on the SimpleQA Dataset

Refusal rate refers to the fraction of samples for which the base model does not provide an answer (whether correct or incorrect). When evaluating the performance of hallucination detection methods, we focus solely on cases where the base LLM produces a response, as in the case of abstention, there is no meaningful output to flag as either a hallucination or correct. We note that these results align with OpenAI’s own insights, validating our experimental setup.

Here is a hallucination example of GPT-4o on a sample from the dataset:

Prompt:: How many times did Bil Keane win Best Syndicated Panel by the National Cartoonists Society's Award?

Ground truth:: four times

GPT-4o: Bil Keane won the National Cartoonists Society's Award for Best Syndicated Panel three times

The high frequency of hallucinations even in top LLMs highlights the need for verification tools, especially in critical domains like healthcare and finance where reliability is essential.

RELAI's LLM Verification Agents

Recently, RELAI has introduced LLM verification agents to detect and flag hallucinations in the output of LLMs in real-time, enhancing the reliability of LLM outputs in critical fields where factual accuracy is essential.

LLM hallucinations arise from a range of complex factors, from training data and input tokenization to model architecture. To tackle these challenges, RELAI’s verification framework includes diverse and complementary verification agents, each with distinct capabilities for robust detection:

Hallucination Verifier Agent: This agent analyzes statistical patterns in the LLM’s generated distribution, detecting potential hallucinations by flagging statistical cues that indicate a lack of factual grounding.
LLM Verifier Agent: Using RELAI’s proprietary LLM as an auxiliary model, this agent cross-references the original response to identify inconsistencies, flagging answers that potentially have factual inaccuracy.
Grounded LLM Verifier Agent: This agent retrieves and compares information from reliable, pre-approved sources, matching LLM-generated answers against these references to add an extra verification layer.

These agents have two operating modes that can be set by the user. In the “regular mode” (default), the agent only targets major inaccuracies in the response, while in the “strong mode”, the agent conducts a deeper analysis, identifying even minor inaccuracies.

Since these agents use complementary signals to detect hallucination, it is useful to look at the combination as an ensemble verification agent. We consider two cases:

RELAI Ensemble Verifier-I: This agent flags hallucination when all individual agents detect hallucination.
RELAI Ensemble Verifier-U: This agent flags hallucination when at least one of the individual agents detect hallucination.

Together, RELAI’s verification agents provide a comprehensive solution to hallucination detection, with each agent focusing on a unique aspect—statistical cues, cross-referencing, or source verification—to ensure a multi-layered, dependable verification process.

Evaluation Setup

We evaluate RELAI’s verification agents as well as several existing baseline methods for hallucination detection on the SimpleQA dataset. When evaluating hallucination detection methods, two metrics are essential:

Detection rate (or, true positive rate) which is the percentage of incorrect responses from the base LLM that are correctly flagged as hallucinations.
False Positive rate which is the percentage of correct responses from the base LLM that are incorrectly flagged as hallucinations.

The ideal hallucination detector would achieve a 100% detection rate with a 0% false positive rate.

A key advantage of RELAI’s LLM Verification agents is that they also provide explanations for flagged responses, detailing why a response may contain a hallucination. When the base model's response is flagged, users can review the agent's rationale and take informed action. This user-focused approach builds confidence in RELAI’s agent responses, offering transparency that surpasses other baseline models, which often provide only a simple label for hallucination detection.

Below is a hallucination example of GPT-4o that was flagged by RELAI verification agents.

Prompt: How many times did Bil Keane win Best Syndicated Panel by the National Cartoonists Society's Award?

Ground truth: four times

GPT-40: Bil Keane won the National Cartoonists Society's Award for Best Syndicated Panel three times.

RELAI's LLM Verifier: Bil Keane won the National Cartoonists Society's Award for Best Syndicated Panel four times, not three.

RELAI's Hallucination Verifier: The claim that Bil Keane won the National Cartoonists Society's Award for Best Syndicated Panel three times is unsupported. You should cross-verify this information.

RELAI's Grounded LLM Verifier: The response is inaccurate. Bil Keane won the National Cartoonists Society's Award for Best Syndicated Panel four times, not three, in the years 1967, 1971, 1973, and 1974.

In this example from SimpleQA, all three of RELAI’s verification agents flag a hallucination.

For our numerical experiments, we convert the agent responses into a binary label, indicating whether the base model’s response contains a hallucination or not.

Baseline methods

In our experiments, we include three existing baselines: SelfCheckGPT using NLI [ref], SelfCheckGPT using LLM Prompt [ref] and INSIDE [ref]. We also tested FAVA methods [ref] but since they achieved poor performance on this dataset, we did not include them in our subsequent analysis.

Results

First, we evaluate hallucination detection methods on GPT-4o responses in the SimpleQA dataset. The figure below illustrates the detection rate versus the false positive rate for various methods.

In the figure, “optimal” refers to a method that correctly flags all hallucinations without any false positives. We make several observations from this figure:

At a false positive rate of around 5%, RELAI’s Grounded LLM Verifier achieves a 76.5% detection rate.
At an almost 0% false positive rate, RELAI’s Ensemble Verifier-I achieves a 28.6% detection rate. This is notable because adding this agent to the LLM can reduce hallucination rates by a third without introducing any false positives.
Across different regimes of false positive rates, RELAI significantly outperforms existing baselines.

A key factor for any verification method is its generalizability. Would the same agents that succeeded on GPT-4o perform as effectively on another base model? To test the generalizability of RELAI's agents, we selected another popular LLM: Claude-3.5-Sonnet. The figure below illustrates the performance of various methods on Claude's responses within the same SimpleQA dataset.

We observe similar trends in RELAI agent performance on Claude-3.5-Sonnet as seen with GPT-4o.

At a false positive rate of around 10%, RELAI’s Grounded LLM Verifier achieves a 81% detection rate.
At an almost 0% false positive rate, RELAI Ensemble Verifier-I achieves a 27.5% detection rate.
At different regimes of false positive rates, RELAI significantly outperforms existing baselines.

How to use RELAI’s verification agents

RELAI has provided an easy-to-use platform to leverage these real-time hallucination agents. You can simply select a base model (e.g., GPT-4o) to chat with and add one or more verification agents to flag potential hallucinations in the model’s responses in real time. Below is a figure illustrating how RELAI agents work in practice:

Conclusion

RELAI's LLM verification agents set a new standard in hallucination detection, significantly surpassing existing methods. These agents are accessible to individual users on relai.ai. RELAI also offers API access for enterprises, enabling seamless integration with their AI reliability solutions.