ChatGPT pretends to run code
I recently had the unsettling experience of witnessing ChatGPT pretending to run code and then displaying an "output" that was in fact pure hallucination.
The issue of LLMs being bad at mathematics is well documented, but most AI assistants get around this by running non-trivial computations using Python or JavaScript, ensuring results are accurate.
Whenever you see ChatGPT or other AIs produce a code block like the one below...
numbers = [14, 13, 7, 42]
total = sum(numbers)
print(total)
76
... you should feel confident that the result is as reliable as using a calculator. No hallucinated answers to worry about here, as long as you know enough Python to follow along.
So imagine my shock and horror yesterday when I watched ChatGPT engage in a smoke and mirrors act of computing a sum in code while running garbage probabilistic LLM math under the hood!
The whole issue arose when I asked for a relatively simple operation: aggregate figures from three data sets, then check that the resulting totals match those of the underlying tables (a basic way to ensure that no data was overlooked in the merge).
ChatGPT aggregated the data, displayed an output, and confidently stated that the totals matched with the inputs I had provided. However, just to double check I copied the result into a spreadsheet and discovered that the merged total did not, in fact, match that of the underlying data sets. ChatGPT had screwed up the merge, then proceeded to say that the results had been verified in code when in fact it hadn't.
Now watch what happens when I call it out:
At first it claims to have run the calculation in Python, even displaying code blocks to back up its assertion that it hadn't just "eyeballed" the total. But check for yourself and you'll see that those 19 numbers do not add up to 29,204 as claimed.
When I accused it of lying, ChatGPT thought for over 2 minutes and then conceded that it had not, in fact, run any code whatsoever but had merely pretended to!
How could this happen? Well it appears that ChatGPT generates its responses as markdown, and that outputs displayed in code blocks don't guarantee that said code was actually executed. In fact, merely wrapping a snippet of text in backslashes makes it appear as code on the frontend, like so:
# vibe math
1+1
3
By contrast, whenever ChatGPT executes code for real, it will be evident from a little icon [>_] that you can click to inspect:
So next time you see ChatGPT produce a code block without the [>_] icon, be vigilant and double check that the output hasn't been hallucinated!
I thought I'd share this to ensure that others don't get fooled. Hopefully someone at OpenAI will be alerted to this issue and fix it in the near future, as a code block with an "output" that purports to be real can easily mislead users who would normally be vigilant to the perils of relying on LLMs for computation.
Enjoyed this article? Subscribe to my blog via email or RSS feed.