OpenAI released ChatGPT one year ago, kicking off a boom in generative artificial intelligence. Since then, the advances have come at a furious pace. Large language models, which underlie chatbots, can now be used to analyze data, create charts and graphs, interpret photographs and converse in real time with audio features.
But at least one fundamental flaw remains: Large language models make stuff up.
These errors have become known as hallucinations, although some have argued confabulations is a more appropriate word. The possibility of mistakes is an obstacle for deploying LLMs, especially in settings where accuracy is crucial, such as health care, finance, law and education. Dave McKay, chief executive of Royal Bank of Canada, said at an event in November that the technology is not ready for prime time because it is not reliable enough.
These quirks are no mere bug, however; they’re an inherent feature of how these models work. That means solving the problem could require new approaches. Researchers are trying a range of techniques to reduce errors and improve reliability, such as by building bigger models, adding in source citations, drawing on external databases for information and prompting LLMs to fact-check themselves before providing responses. Nothing has proven to be a silver bullet.
LLMs get things wrong partly because the data on which they are trained, much of it pulled from the internet, may contain errors, biases and falsehoods that inevitably creep out. Plus, in order to produce text, these AI models look at a string of words and perform a calculation to predict the next one. LLMs have no concept of whether the output is factual or even logical, which is where things can go off the rails.
“They’re just not reasoning about what they’re actually saying. They’re just essentially parroting from all the data that they’ve observed before,” said Joelle Pineau, a vice-president of AI research at Meta Platforms Inc. META-Q in Montreal.
LLM developers have tried to make the models more accurate through something called reinforcement learning from human feedback, or RLHF, which relies on people to evaluate responses, rewrite answers and correct factual errors. While these annotations help teach the model to produce higher-quality outputs and have contributed to a big leap in performance, LLMs are still error-prone.
That hasn’t stopped corporations from rolling out AI applications, particularly chatbots and knowledge assistants that can answer questions. To further limit mistakes, many companies are using an emerging technique called retrieval augmented generation, or RAG. Broadly, the term refers to using an AI model to source information outside of its training data, similar to how you might pull up Wikipedia when stumped about a fact.
Connecting a chatbot to the internet, as OpenAI has done with the paid version of ChatGPT, is a form of RAG. More commonly for corporations, RAG involves pulling information from proprietary documents and material, such as a customer service chatbot tapping into documents outlining policies or transcripts of past conversations to produce more accurate answers.
“The reality is large language models are not good databases,” said Nick Frosst, a co-founder of Toronto-based LLM developer Cohere. Where people can go wrong is by treating LLMs as a de facto source of truth, when in reality, the answers they need could be located elsewhere. That’s a gap that RAG can help bridge.
“We’ve switched from using the model as an oracle,” he added, “to using it as an interface for a different source of truth.”
Cohere has long believed in the potential for RAG to make generative AI useful for enterprise customers, even hiring Patrick Lewis, who was the lead author on a seminal paper about RAG published in 2020 with Meta’s AI division. In September, Cohere announced a way for developers to more easily use RAG with one of its large language models. “This is a real step towards addressing the hallucination problem,” Mr. Frosst said.
Cohere has taken other measures to improve reliability, too. Its chatbot, dubbed Coral, produces source citations when connected to the web so that users can check the facts themselves.
Telus Corp. T-T is taking a measured approach to generative AI, in part because of the possibility of hallucinations. In April, the telecom formed a generative AI board that includes CEO Darren Entwistle to guide the company’s approach and later created a data-secure environment for employees to experiment with the technology.
For the past couple of months, Telus Corp. has been testing a chatbot to help customer service agents field questions. The chatbot uses RAG to pull from the company’s internal documents to provide information, and it has already helped reduce the training time for new call centre agents by two weeks, said Telus Corp. chief information officer Hesham Fahmy. It hasn’t been rolled out to every call centre employee, and since the accuracy rate is about 85 per cent, the bot does not directly interact with customers yet.
“You never want to give a customer a wrong answer,” Mr. Fahmy said.
But the mistakes are not necessarily because of hallucinations. Instead, the underlying data itself contains errors. “I don’t have the scientific proof of it,” he said, “but if we fix all that bad data, the correctness would have gone north of 95 per cent.”
Recently, a U.S. AI company called Vectara, which was founded by former Google employees, attempted to pin down just how often LLMs hallucinate by building an AI model to check how accurately popular language models can summarize documents.
The results, released in November, showed that OpenAI’s GPT-4 performed the best with a 3 per cent hallucination rate. Two Google models were the worst, with one flubbing it more than 27 per cent of the time by inventing details that did not appear in the source material. Cohere, meanwhile, scored about in the middle of the pack, with one of its models hallucinating 7.5 per cent of the time. (Ironically, the AI model Vectara built to classify hallucinations isn’t 100 per cent accurate either, the company noted.)
For Vectara chief technology officer Amin Ahmad, the results were actually pretty good, with most models hallucinating less than 10 per cent of the time. “My prediction is that within the next year or two, this is more or less going to be a solved problem,” he said.
Key to doing so is building bigger AI models. Often researchers talk about the number of parameters a model possesses, which generally refers to the components that make up a model’s architecture. While far from a perfect analogy, parameters in an AI model can be thought of as similar to the number of neural connections in the brain. GPT-3, for example, has 175 billion parameters. As models have gotten bigger, hallucination rates have fallen, Mr. Ahmad said.
The accuracy of today’s models allows them to be useful in some cases, such as customer service, but risky in others. “The business world is really waiting to get to the phase where they can automate away a lot of jobs and decision-making with these LLMs,” Mr. Ahmad said. “It’s important for them to understand clearly we’re not at that point.”
Health care is a particularly sensitive setting, though Google is trying to make inroads. Earlier this year, the company announced an updated version of its medical question-answering bot, Med-PaLM 2. Google’s research paper noted that physicians preferred the responses from the LLM over those provided by real doctors in most domains – though not all – and Med-PaLM 2 provided inaccurate or irrelevant information in some instances.
To improve accuracy, other researchers are tweaking how LLMs operate, such as by building in extra verification before the AI model spits out any text. In September, researchers at Meta published a paper outlining a method called chain-of-verification, prompting an LLM to essentially fact-check itself. The model first drafts a response to a query, concocts questions to check its work and then produces an improved response.
As an example, the researchers asked a language model (in this case, Meta’s own model, called Llama) for a list of politicians born in New York. The model first listed Hillary Clinton, Donald Trump and Michael Bloomberg. It then asked itself where each of those politicians was born. Afterward, the model revised its response to correctly identify Mr. Trump as the sole New York native of the three. For long-form responses, the Meta researchers found a 28-per-cent improvement in terms of accuracy.
These fact-checks can nevertheless fail and the model can still produce misleading information, the researchers noted. Plus, the approach requires more computing power and is more expensive as a result.
When it comes to fixing hallucinations in LLMs, Ms. Pineau says that RAG techniques, fine-tuning and bigger models with higher quality data can all help, but are not enough. “If you want them to start saying things that are correct, then you probably need other strategies,” she said.
One of her colleagues at Meta, chief AI scientist Yann LeCun, has argued that AI models require a fundamental redesign if they are ever to be truthful and reliable. Mr. LeCun, a Turing Award winner, is exploring a different approach that involves an AI model making judgments based on a general understanding – similar to how we make decisions – rather than by trying to predict individual words on a granular level.
Ms. Pineau doesn’t think it’s realistic for LLMs to be completely free of errors – humans make mistakes, and there is often genuine complexity in determining objective truth – nor does she believe hallucinations are the biggest issue. What’s important, she said, is to be educated on how to use LLMs appropriately given the limitations. “The responsibility to verify the information is still on us.”