Is AI Search Really As Bad As You've Heard?

What's with that new report showing error rates as high as 94%?

AI’s search for meaning and its ‘desire’ to make you happy

Last week, I got an answer from Perplexity that was absolutely filled with hallucinations, and it made me think had a recent article from The Conversation about a study showing that LLMs often find meaning where there is none.

The study looked at how meaningful phrases were, comparing meaningful pairs, like “beach ball,” to less meaningful pairs, like “ball beach.” They found that LLMs ranked “ball beach” as more meaningful than humans did. .

The researches posit that LLMs try hard to find meaning, which means they may be likely to fail at tasks that require them to recognize errors — like responding to an email accidentally sent to the wrong person or taking meeting notes about someone’s contribution that made no sense.

And I suspect this was the problem with my catastrophic Perplexity search. I had spent about 30 minutes trying to find an answer myself to the question of why we coined the term “caregiver” in the 1960s when the term “caretaker” had been used for years, and I couldn’t find anything.

I’d been looking for a reason to test Perplexity’s new Deep Research mode, and this seemed like a good use case.

Perplexity came up with an impressive research brief that had interesting, plausible reasons that people had coined the word “caregiver.” But when I checked the sources, only one of about 20 actually supported the points listed in the brief. They were shockingly irrelevant (although it did find one good source for a minor point that I did end up using).

My theory is that because there truly is no known answer to the question, it “wanted to make me happy” or “find meaning” and fabricated something.

A new study by the Columbia Journalism Review also seems to supports this idea. They tested different AI search tools by giving them data with incomplete information. When asked to extract info from the data, all the tools made up information at an alarming rate:

  • Perplexity: 37%

  • ChatGPT Search: 67%

  • Grok: 94% 👀

Another surprising finding is that the paid versions performed worse than the free versions. If seems if you’re paying, they try even harder to give you an answer even when there is none.

I remain wary of AI search. It can definitely be useful, but you must, must, must check every piece of information. (And yes, often this still saves time over doing search the old way. What I worry about even more is the degradation of Google and the open web so that it becomes harder to know what you can trust when you’re doing verification searches.)