A ridiculously useful tip

A simple way to grab and wrangle messy data

Issue 67

On today’s quest:

— Grabbing messy data
— The model matters, part II
— Energy: What uses more?
— AI’s erratic energy use problem
— Webinar: Storysnap
— Meta is developing chatbots that message you first
— Calling all Sallys
— We are still underreaction on AI
— The AI horse is out of the barn in publishing
— 🤣 A trickster steals an AI band’s identity 🤣

Tip: Grabbing messy data

This tip comes directly from the Almost Timely newsletter by Christopher Penn, and I am beside myself with how useful it is!

If you need to gather messy data from online, do a screen recording of it, feed the video into an LLM that can interpret video (e.g. Gemini), and ask it to turn the data into JSON. 

Christopher’s example was gathering and analyzing data from Google Business Reviews, but my problem has always been Facebook comments. My most popular posts are when I ask people to tell me something about how they use language. People love to comment, and then I create charts, which become popular follow-up posts.

For example, I ask whether people pronounce “coyote” with one or two syllables, and I can get hundreds — sometimes thousands — of comments.

Screenshot of a Facebook comment thread discussing the pronunciation of the word "coyote." The first comment mentions that the commenter pronounces it "kai-oat-ee" in most contexts but says "kai-oat" when referring to the animal. A link to a Britannica article on Coyote mythology is included. The second commenter replies that they pronounce it "kai-yote," but sometimes say "kai-yo-tee," noting that they seem to use both. Each comment is marked as being posted 6 years ago, and the second comment has a thumbs-up reaction. Profile pictures and names are partially obscured for privacy.

In the past, it has taken me hours to collect, collate, and analyze the comments because they’re in such a messy format. So I rarely do these posts even though people love them.

I just tried Christopher’s suggestion on comments from an old post like the one above, and it worked! This was my prompt:

Transcribe the Facebook comments in this video into JSON format. The required keys and values of the JSON format are: commenter name, comment verbatim text.

In less than five minutes, I had structured data. I then asked Gemini to evaluate the answers and add a parameter tracking whether the commenter said they used 2 or 3 syllables. This was my prompt:

Each comment is attempting to answer the question of whether the commenter pronounces the word "coyote" with two or three syllables. Please attempt to interpret each comment and add a key and value for each comment called "syllables" with a value of 2 or 3 to indicate how many syllables the commenter uses.

This doesn’t seem straightforward because people answer in all kinds of long and winding ways, but at first glance, Gemini didn’t seem to misinterpret a single one (even coding a comment as “null” when it wasn’t clear even though I didn’t include that in the instructions).

Of course, I would still read all the comments to make sure the interpretation is correct; and more important, because I enjoy reading the comments, and it is how I understand what people are saying so I can write a post about the data. But this “one simple trick” will save me hours that I would have previously spent on tedious data manipulation.

Technical Notes:

  1. I used QuickTime on the Mac for screen recording. It’s built-in.

  2. When I attempted the same thing in ChatGPT 4o, it did not work well at all, so the model matters here.

  3. If you want to get the JSON into a spreadsheet, you have to convert it to a .csv file. You may want to play around with having Gemini just create a .csv file in the first place, but Christopher says JSON is the best format if you want to do further manipulation of the data in an LLM.

Subscribe to the Almost Timely newsletter. It can be a bit technical at times, but you’ll be glad you did!

The model matters, part II

You may remember a viral story from back in June where ChatGPT lied (and apologized and lied and apologized over and over) to a woman who asked it for a review of her writing, insisting it could access online stories when it couldn’t.

Well, the Hybrid Horizons newsletter took the same prompts and fed them into a different model — OpenAI o3 — and got good results. No lies. No apologies.

None of the models are perfect — and it’s important to note that the original “experiment” was done when ChatGPT was having a huge problem with being too sycophantic, and it has since been dialed back — but this is a good example of why it pays to understand how each model works and to choose the right one for the task.

Energy: What uses more?

Screenshot of a digital carbon calculator comparing the environmental impact of two tasks. On the left, Task 1 is "Zooming for 1 hour with 10 people" using a solar-powered California energy source, simple prompt complexity, a single zero-shot prompt, and a cold data center climate (Norway in winter), with a quantity of 1. On the right, Task 2 is "3-second AI video from text" with identical settings. The results below show that Task 1 consumes 2000 Watt-hours (Wh) and produces 8000 cubic centimeters of emissions, while Task 2 uses 500 Wh and generates 2000 cubic centimeters. A small bar chart visually compares the footprint of both tasks, with Task 1 showing a significantly higher impact.

Jon Ippolito made a tool that directly compares the energy use of different activities in different scenarios. It has a rudimentary interface and seems to be designed for classroom use, but it’s fun and interesting to plug in different scenarios — like watching an hour of Netflix compared to generating a 3-second AI video from text — and then changing the parameters, like the number of paragraphs or text or whether the data center is in a hot climate or a cool climate. You quickly realize there is no one simple answer.

AI’s erratic energy use problem

The amount of energy AI uses isn’t actually the only energy problem. The uneven nature of the energy demand is another. In an article in the Financial Times, Switzerland-based Andreas Schierenbeck, chief executive of Hitachi Energy, said AI training runs cause sudden spikes in energy demand at data centers that can be 10x the normal load. He says other industries aren’t allowed to use energy this way. For example, if you want to start a new smelter company — another highly energy intensive business — you have to clear it with local utilities and give them time to adapt. He says AI companies should be regulated the same way and schedule training when renewable energy is plentiful.*  

Webinar: Storysnap

The Editorial Freelance Association is hosting a free webinar about Storysnap, an AI tool for evaluating manuscripts, on July 22 at 5 p.m. Eastern. The presentation is put on by the company, but it looks like an interesting product to understand. If you aren’t a member of EFA, click “guest” on the registration form. h/t Katharine O’Moore-Klopf

Meta is developing chatbots that message you first

If you’ve used Meta’s AI Studio, chatbots may soon start messaging you on their own instead of waiting for you to send a message. Meta says the bots will only message someone who has sent at least five messages in the last 14 days and won’t send a second message if they don’t get a reply.

The goal is to keep you talking, and the ideal message they’re working on delivering will “reference something concrete from the user’s past conversations.” It reminds me a bit of the emails you get from companies when you abandon a shopping cart. Business Insider

Ad: AI with Allie

Since I’m a LinkedIn Learning instructor myself, I’ve seen how well they vet their courses, and I was impressed to see that Allie Miller is also a LinkedIn Learning instructor.

Expand What AI Can Do For You

Tired of basic AI prompts that don't deliver? This free 5-day course shows you how to create tools that actually address your problems—from smart assistants to custom software.

Each day brings practical techniques straight to your inbox. No coding, no fluff. Just useful examples to automate and enhance your workflow.

Follow-ups

I seem to have a lot of stories today that continue something I mentioned in a past newsletter:

Calling all Sallys?

June 26, I mentioned that LLMs tend to overuse the names Emily and Sarah. Here’s another anecdote with a teacher saying many student essays describing a time they witnessed discrimination involved a victim named Sally, leading the teacher to suspect the essays were written by AI.

We are still underreacting on AI

Also on June 26, I included a link to Pete Buttigieg’s newsletter post titled “We are still underreacting on AI,” and a reader asked about my thoughts on the essay. I didn’t have time to comment on it earlier, but I heartily agree. It’s why I spend hours a day reading and writing about AI when, clearly, I shouldn’t be because it’s not my main job. And I was heartened to see someone who might be in government again someday clearly articulate the situation. Yes, to all of this:

And when I say we’re “underprepared,” I don’t just mean for the physically dangerous or potentially nefarious effects of these technologies, which are obviously enormous and will take tremendous effort and wisdom to manage. But I want to draw more attention to a set of questions about what this will mean for wealth and poverty, work and unemployment, citizenship and power, isolation and belonging.

In short: the terms of what it is like to be a human are about to change in ways that rival the transformations of the Enlightenment or the Industrial Revolution, only much more quickly.

I recommend reading the whole thing.

The AI horse is out of the barn, off the farm, and over the river in publishing

June 30, I posted a link to an open letter from hundreds of prominent authors demanding that publishers never use AI for anything. I’m not a fan of open letters or statements that include “never for all time” kinds of demands, so I didn’t write about it further.

But Jane Friedman’s response in her Bottom Line newsletter was well-reasoned, and I’d like to highlight it here. Among other things, she points out that AI is already being widely used in publishing, and publishers currently have no reliable way of detecting the AI writing the authors want them to reject.

I know I’m saying this a lot this newsletter, but read the whole thing — and while you’re at it, subscribe to The Bottom Line. 

🤣 A trickster stole and AI band’s identity 🤣

I love this story so much I almost made it the lead, but it’s only peripherally about AI, so I reined myself in. Here’s the deal:

On July 2, I included a link to a story about a band called Velvet Sundown that was knocking it out of the park on Spotify with what everyone was convinced was AI-generated music. At the time, nobody could reach the band, so a guy with a history of online stunts turned one of his old Twitter accounts into an account for the band and started filling the void. Thanks to both social engineering and expert-level social media instincts, he gained a huge following and did some high-profile media interviews.

It’s a hilarious. I read it out loud to my husband, who was also cracking up, but I also can’t get it out of my head. If I were still teaching social media to journalism students, this story would be required reading because it’s revealing about how our systems are being gamed every day.

How much do you trust AI overviews?

I’m curious about something: If you use Google search, how much do you trust the AI Overviews that appear at the top of the search results?

How much do you trust AI Overviews?

Login or Subscribe to participate in polls.

Quick Hits

Philosophy

AI enables access (This line stood out to me: “I am doing what everyone else does, getting help. Mine just happens to be a chatbot, not a family friend in editorial.”)The Bookseller

I’m scared/I hate it

Other

How AI made me more human, not less (A woman’s story about using ChatGPT to cope and reengage with friends and family after an epilepsy diagnosis.)— New York Times

Embarrassing AI errors/Cringe

What is AI Sidequest?

Are you interested in the intersection of AI with language, writing, and culture? With maybe a little consumer business thrown in? Then you’re in the right place!

I’m Mignon Fogarty: I’ve been writing about language for almost 20 years and was the chair of media entrepreneurship in the School of Journalism at the University of Nevada, Reno. I became interested in AI back in 2022 when articles about large language models started flooding my Google alerts. AI Sidequest is where I write about stories I find interesting. I hope you find them interesting too.

* SMELTERS: As part of my research for this paragraph, I looked at how much energy smelters use, and wow — no wonder they want us to recycle aluminum cans!

Written by a human