AI Chatbots Becoming More Powerful, but Less Reliable, Study Finds
A new study published in Nature reveals that as artificial intelligence (AI) chatbots become more sophisticated, they are also becoming more prone to providing inaccurate information. The research, which examined leading commercial large language models (LLMs) including OpenAI’s GPT, Meta’s LLaMA, and the open-source model BLOOM by BigScience, found that while these models are increasingly capable of answering complex questions, they are also more likely to fabricate responses rather than admit to not knowing the answer.
José Hernández-Orallo, a coauthor of the study from the Valencian Research Institute for Artificial Intelligence, noted that LLMs are now attempting to answer almost every query posed to them. This approach has led to an increase in both correct and incorrect responses. Mike Hicks, a philosopher from the University of Glasgow not involved in the study, described this behavior as “bullshitting,” suggesting that the models are pretending to be knowledgeable even when they lack accurate information.
The researchers tested the models on various topics, including mathematics and geography, and asked them to perform tasks such as listing information in a specific order. While larger, more powerful models generally provided the most accurate responses, they struggled with more challenging questions, showing lower accuracy rates. Notably, OpenAI’s GPT-4 and o1 were identified as some of the biggest offenders in providing incorrect answers.
A concerning trend emerged across all studied LLMs, including the LLaMA family, where none reached 60 percent accuracy for even the easiest questions. Moreover, as the AI models grew in size and complexity, the percentage of wrong answers they provided also increased.
The study also revealed a potential issue with human perception of AI capabilities. Researchers found that humans may overlook the flaws in these models due to being impressed by their ability to handle sophisticated problems. Participants judging the accuracy of chatbot answers were incorrect between 10 to 40 percent of the time.
To address these issues, researchers have proposed programming LLMs to be less eager to answer every question, suggesting the implementation of a threshold for challenging queries. However, this solution may not align with the interests of AI companies, as it could expose the technology’s limitations.
This research raises significant concerns about the trustworthiness of increasingly sophisticated AI chatbots and highlights the need for a balance between technological advancement and reliability in AI responses. As these systems continue to evolve, it becomes crucial for developers, users, and policymakers to consider the implications of AI-generated information and work towards more transparent and accurate AI communication tools.