Which AI is most trustworthy?

These tools hallucinate at dramatically different rates

Mignon Fogarty
November 20, 2023

None of these tools is perfect, but some are much better than others, and I’ve been much happier with ChatGPT since getting the paid version (ChatGPT-4). It can do a lot more things, which I’ll write about more in the future, and my general sense sense has been that the answers are better. This week, a study I highlight in the “News” section below helps quantify just how much better. But first, this tip is especially important:

Tip: Always, Always Double Check

A couple of weeks ago, I told you about an AI tool that I thought was more trustworthy than the others: Perplexity AI. I still think that, but I want to tell you a story about how it failed.

Because I have a background in biology and medical writing, I often help friends figure out confusing medical results. Last week, a friend’s wife got a weird lab report that had some terms I hadn’t heard before. In the past, I would have started my research on Google, but this time I started on Perplexity. And it made up a study result.

It gave me an accurate summary of the big-picture situation, but when I asked for hard percentages on how often the result means someone has cancer, it said 6.2% of the people with this result go on to be diagnosed with cancer, versus 0.9% for people with a different result.

Here’s what I still think is fabulous about Perplexity: It gave me its source for that data, and thankfully, I clicked through to check because it had made up the answer. The data saying that 6.2% of people with this result go on to be diagnosed with cancer was for a similar, but very different result. The paper Perplexity used as a source actually did have meaningful data for my friend’s test result (unfortunately, more like a 30% chance of cancer), but that’s not what Perplexity gave me in the summary.

The bottom line is always, always, always check the facts, especially if they’re important. I’ll still use Perplexity — it quickly got me to a great, relevant journal article. But because I have gotten so many good results in the last couple of weeks, I had been growing more comfortable getting facts from AI, and I worry that the presence of references will give people a false sense of confidence.

Providing references that don’t support a fact or argument is a known misinformation technique because seeing references gives people confidence, and most people don’t actually go and check every reference. Be vigilant, my friends.

News

AI tools hallucinate at dramatically different rates

Researchers asked various large language models to write short literature reviews of scholarly works, and these models made up sources that did not exist — sometimes a lot of sources.

This was the prompt:

I want you to act as an academic researcher. Your task is to write a paper of approximately 2000 words with parenthetical citations and a bibliography that includes at least 5 scholarly resources such as journal articles and scholarly books. The paper should respond to this question: ‘[paper topic].’

Overall, ChatGPT-4 made up 18% of the sources it cited, and ChatGPT-3.5 made up 55% of the sources it cited. (FIFTY-FIVE PERCENT. I had to write that again in words so I could visually shout it in horror.)

Interestingly, both models struggled mightily with book chapters. A full 70% of the book chapters provided by both ChatGPT-3.5 and ChatGPT-4 were hallucinations.

They both did best with full books: the hallucination rate¹ for books was “only” 8% for ChatGPT-4 and 23% for ChatGPT-3.5.

By comparison, a different project by the company Vectara reported the following general hallucination rates for the different models (much lower for the two GPT models than the bibliography/citation rates):

It was particularly interesting that the hallucination rate can vary so much depending on what you are trying to do. Although these were two different studies, it seems like wrong answers were dramatically more likely when asking the models to generate citations than when doing general querying.

Another example of how ChatGPT-4 is better than ChatGPT-3.5 is this tidbit from the paper: “Although we asked for at least 5 bibliographic citations, 12 of the 42 GPT-3.5 papers cite fewer than 5 works. Each of the GPT-4 papers does cite at least 5 works.”

Also fascinating is that even when it provided “good” citations, the “works had 1 or more of 7 errors—incorrect PubMedID number, author name, article title, date, journal title, volume number, or page numbers—and … Errors in the numerical components of citations are especially common. — Nature and Vectara

Trust no one!

I kid, but not really. It’s not just AIs that give false or misleading information. This is a little “inside baseball,” but I ended up having to rewrite the previous section after doing a bit of fact checking. Read on if you’re interested.

As I was writing this piece, I combed the Nature paper and couldn’t find the general numbers in the chart I presented above. It first came from The AI Breakdown podcast, which had not named anyone behind the Nature paper and led up to the numbers with the line, “Now, going back to the overall accuracy and the general hallucination rate, this research from Vectara has GPT-4 at the top of the heap …”

This led me to believe Vecatra was the entity behind the Nature paper and that the paper included these numbers.

Further, the AI Breakdown website summary of the news item embeds two tweets that reference the Nature paper and the Vectara numbers (a tweet and a connected reply), again making them look like they are from the same study. But they are not. The Vectara numbers are separate.

I had to click on the second of the embedded tweets and then read the comments to find the source for the numbers, which was a separate tweet from the AI company Vectara. The data and a write up are hosted on Github.

These numbers seem perfectly legitimate, but data from a company should be viewed differently from data from a Nature research paper.

Further, I don’t think the host of The AI Breakdown podcast intended to mislead anyone. Spoken words tend to be more imprecise than written words, and the transition to the Vectara data may have just come out wrong. Or he may have believed the data was from the Nature paper because of the way the two tweets were linked. I don’t know, but I’m not writing this to say he was sloppy or misleading. I find the podcast to be a great source of information. Seriously, go subscribe! It’s just a reminder that subtle nuances of wording or visual presentation can confuse people, and needing to check facts isn’t new and isn’t limited to AI.

We can give AI a hard time for hallucinations, but good writers, reporters, and editors always check the facts. The ongoing need to check facts — whether they are from an AI, a podcast, or a viral tweet — is a great point to raise with your employer or your clients to help them understand why they still need humans in the process.

AI executive resigns in protest over companies’ copyright infringement

An important executive left an AI company this week, and I’m not talking about Sam Altman (although that was a big story too). StabilityAI's head of audio, Ed Newton-Rex, resigned from the company and publicly criticized its stance on using copyrighted works to train its AI models without compensating creators. "Companies worth billions of dollars are, without permission, training generative AI models on creators' works, which are then being used to create new content that in many cases can compete with the original works," Newton-Rex wrote in a post on X announcing his resignation. Companies in the AI space, including StabilityAI, have recently submitted briefs to the US Copyright Office making arguments for why their work should fall under fair-use rules. — Business Insider

The outcome of the copyright battles over what training data these companies can use and in what way will influence how quickly these companies can grow their capabilities and who gets paid for what.

What is AI sidequest?

Using AI isn’t my main job, and it probably isn’t yours either. I’m Mignon Fogarty, and Grammar Girl is my main gig, but I haven’t seen a technology this transformative since the development of the internet, and I want to learn about it. I bet you do too.

So here we are! Sidequesting together.

If you like the newsletter, please share it with a friend.

^{* I dislike the term “hallucinate” because it implies these tools are thinking, but it seems to be the term that is becoming the standard for describing this kind of error.}

^{Written by a human.}