The new rules of prompting

See an example of how the new models require new approaches.

Issue #47

On today’s quest:

— Prompting is becoming more complicated
— AI is better than humans at writing some Wikipedia articles
— Microsoft Copilot wrongly attributes crimes to a court reporter
— AI detectors are still bad

Prompting is becoming more complicated

Free online chatbots like ChatGPT generally perform best when you give them what are called “chain of thought” prompts, which include statements guiding the LLM through multiple steps or including a line that says “think this through step-by-step.”

But both AI advances and the availability of different models now require you to prompt in different ways.

For example, the small open-source LLM called Llama that I installed on my laptop because it uses less energy and is more private isn’t powerful enough to think things through step-by-step on its own. You need to give it each step in the process sequentially to get the best results.

On the other hand, the super-powerful new ChatGPT model called o1, which is currently available to only paid subscribers, is so good at “reasoning” that you’ll get better results if you don’t tell it to think step by step. OpenAI has published a guide and an example prompt on Reddit to help people prompt the new model (see below).

All in all, figuring out which prompting format to use is becoming more complicated.

o1 Prompting advice from OpenAI

AI is better than humans at writing Wikipedia articles

Researchers built a custom LLM designed to write Wikipedia articles about genes, which is apparently a big unmet need. Human experts evaluated a subset of the articles and found they were better than existing Wikipedia articles written on the same topic by humans — the LLM-written pieces had more citations and fewer “inaccuracies.”

The AI-written articles were also longer, although I’m not sure that is the same as better. They may have contained more information, or they may have just been wordy. It’s not clear.

However, it took the WikiCrow LLM only a few days to write 15,616 articles, which the researchers estimated would have taken human writers 6.8 years.*

The authors cautioned about extrapolating their results though:

“Although language models are clearly prone to reasoning errors (or hallucinations), in our task at least they appear to be less prone to such errors than Wikipedia authors or editors. This statement is specific to the agentic RAG setting presented here: language models like GPT-4 on their own, if asked to generate Wikipedia articles, would still be expected to hallucinate at high rates.” — The Signpost

Microsoft Copilot wrongly attributes crimes to a court reporter

A court reporter wrote about trials. Then an AI decided he was actually the culprit. “When German journalist Martin Bernklau typed his name and location into Microsoft’s Copilot to see how his articles would be picked up by the chatbot, the answers horrified him. Copilot’s results asserted that Bernklau was an escapee from a psychiatric institution, a convicted child abuser, and a conman preying on widowers. For years, Bernklau had served as a courts reporter and the AI chatbot had falsely blamed him for the crimes whose trials he had covered.” — Neiman Lab

AI detectors are still bad

In case you’re still not convinced, Christopher Penn posted an example on Threads showing an AI detector saying the Declaration of Independence is 97% AI.

Quick Hits

What is AI sidequest?

Using AI isn’t my main job, and it probably isn’t yours either. I’m Mignon Fogarty, and Grammar Girl is my main gig, but I haven’t seen a technology this transformative since the development of the internet, and I want to learn about it. I bet you do too.

So here we are! Sidequesting together.

If you like the newsletter, please share it with a friend.

* The study was not peer reviewed and was funded by former Google CEO Eric Schmidt, but the author of the article thought it was a credible study because of the transparency and data availability.

Written by a human.