AI Chatbots Struggle at Fact-Checking, but Curated Evidence Can Help
AI Chatbots Struggle at Fact-Checking, but Curated Evidence Can Help
Can AI chatbots reliably tell you whether a political claim is true or false? And if not, what would it take to make them trustworthy fact-checkers?
A new preprint from Stanford and peer institutions tackles these questions, evaluating 15 large language models from OpenAI, Google, Meta, and DeepSeek on more than 6,000 claims fact-checked by PolitiFact over an 18-year period. Each model was asked to predict the rating assigned by PolitiFact's professional fact-checkers, choosing one of six labels on PolitiFact's Truth-O-Meter scale, from "True" to "Pants on Fire." This makes the task far more demanding than a simple true-or-false judgment, requiring models to draw fine-grained distinctions about the accuracy of a claim.
The study, led by Matthew DeVerna, a postdoctoral researcher at Stanford's Tech Impact and Policy Center, finds that today's leading models perform poorly, even when equipped with advanced reasoning and web search capabilities. The key to better performance, the researchers found, lies not in smarter models but in giving them access to high-quality, curated evidence.
Standard models fall short
When models relied solely on their built-in knowledge, they all performed poorly. Accuracy ranged from roughly 0.1 to 0.3 on macro F1—a metric that balances performance across all veracity labels—well below what would be needed for reliable fact-checking. That is a concern, given that millions of users already turn to chatbots to verify information. A model that confidently mislabels a claim could reinforce the very misinformation it is being asked to check.
Reasoning and web search are not enough
The researchers also tested models with advanced reasoning capabilities and built-in web search. Reasoning offered minimal improvement, gaining just 0.06 points on average, and in some cases performance actually declined slightly.
Web search helped more, but results were inconsistent. OpenAI's search-enabled models improved moderately, often citing PolitiFact or other credible sources. Google's models, however, struggled to retrieve useful information and rarely surfaced relevant citations at all. The researchers note that the effectiveness of search depends on how queries are formulated, which sources are prioritized, and how retrieved information is integrated into a model's response.
A curated approach delivers dramatic gains
If reasoning and web search fall short, what does help? The researchers leveraged a method called retrieval augmented generation (RAG) to test whether giving models direct access to high-quality evidence would improve performance. Specifically, they built a curated database of summaries of PolitiFact fact-checking articles and retrieved the most relevant summaries to provide alongside each claim the model was asked to evaluate.
The improvement was striking. On average, accuracy improved by 233 percent across all model variants. The best-performing setup achieved a macro F1 score of 0.90, up from just 0.27 without the curated context.
"The key limitation of LLMs is not how models reason over information, but whether they have access to the right information in the first place," the researchers wrote.
Bias in AI-generated citations
The study also uncovered patterns in how search-enabled models select their sources. PolitiFact was the most frequently cited source, followed by outlets like the Associated Press and CNN. While these sources were overwhelmingly credible (nearly 99 percent scored as "Generally Credible" by NewsGuard), the overall citation mix skewed to the left of the political spectrum. This pattern persisted even after removing PolitiFact from the analysis. Whether it reflects biases in the models, their search pipelines, or the broader information ecosystem remains an open question, but the finding raises important considerations for deploying AI in politically sensitive contexts.
Limitations
The researchers note several caveats. The study relies solely on PolitiFact, and results may not generalize to other fact-checking organizations or types of claims. Most of the tested claims predate the evaluation, so model performance on breaking news, where information is incomplete or rapidly evolving, could be worse. And because the study explicitly tests mainstream commercial models, results may shift as providers update their models over time.
A path forward
For researchers and system designers, these findings suggest that augmenting large language models with a curated, evidence-based database is currently the most promising approach for automated fact checking. Building such systems at scale, though, is no small task. They must cover diverse topics, update continuously, and manage conflicting or incomplete evidence.
For everyday users, the takeaway is simpler: exercise caution when using chatbots to verify claims, particularly about politics.
Today’s LLMs, particularly those with web search, offer a glimpse of what is possible. Turning that potential into reliable information verification at scale will depend on improving how high-quality evidence is found and delivered to these systems.
DeVerna co-authored the study with Kai-Cheng Yang of Binghamton University, Harry Yaojun Yan of Texas A&M University, and Filippo Menczer of Indiana University. The work was supported in part by the Institute for Humane Studies and the Knight Foundation.