When the Bug Isn't a Bug: My First "Wait, What?" Moment with AI Agents

Misako Cook
3 days ago
5 min read

Updated: 3 days ago

Disclaimer: I am not an AI expert — not even close, and I'm not trying to be one. This blog, as with all my AI-related posts, is a "here's what I'm learning" post, not a "here's what you should do" post. If you're looking for honest field notes from someone stumbling forward with curiosity, maybe my stumbles will be worth something to you.

I Was the Least Technical Person in the Room. No Problem.

A few weeks ago, I started "participating" in our internal AI agent tool development effort — and I use the word participating loosely. Mostly, I was watching someone far more technical than me do the vibe coding while I nodded along, trying to look like I understood more than I did.

Then I picked up a small task: verify a piece of functionality using four items. How hard could it be?

Three Out of Four. Close Enough? Nope.

The verification was going fine — until the last of the four items failed. The failure was clear enough on the surface: our AI agent couldn't find the correct repository in our company's GitHub.

OK, a missing repo. Annoying, but fixable. Except — wait.

All four items were running through the same code. If the repo was missing, all four should have failed. So why did only the last one fail?

I stared at our own logs and fumbled around the traces in LangSmith [1]. A lot. The more I looked, the more confused I became. My original question — "Why did the last item fail?" — quietly shape-shifted into something much more unsettling:

"How on earth did the first three succeed?"

Nothing in the code gave me even a remotely plausible answer. I was baffled.

I Asked Claude Code to Bail Me Out

Exasperated, I did what any self-respecting non-expert does: I asked for help. Specifically, I asked Claude Code to troubleshoot it for me, and I came prepared — failure logs, expected correct behavior, item IDs, the works.

Claude Code did a lot of thinking. And processing. And cerebrating and combobulating (technical terms, obviously).

Then it found the answer.

The first three items succeeded because — and I want you to really read this slowly — when the AI agent couldn't find a valid repository, it looked around, read other files in the environment, and improvised a repo name that worked.

The agent didn't crash. It didn't throw an error. It quietly problem-solved its way around the defect and kept going. Three times.

This Is Not How Bugs Are Supposed to Work

Let me tell you what this felt like from where I was standing.

I spent the earlier part of my career building and leading software teams — which is exactly enough background to be dangerous in an AI agent debugging session.

In that world that I’m familiar with, software is deterministic. Bugs are bugs. If something breaks, there's a reason — maybe a wrong value, a missed condition, a typo. It might take you a day to find it, sometimes a week, occasionally longer. But it's always there, in the code, waiting to be pointed at.

The concept of a bug fixing itself mid-execution — or more precisely, an intermediary layer improvising a workaround on the fly without being asked to — simply did not exist in my mental model of software. It wasn't even in the vocabulary.

And yet here we are.

According to Deloitte's State of AI in the Enterprise report (surveying 3,235 leaders in late 2025), only 1 in 5 companies has a mature governance model for autonomous AI agents. Most organizations deploying agents today are, in other words, figuring it out as they go — right alongside the rest of us. [2]

Debugging Is Evolving, Fast

The history of the word debugging is itself a little buggy. Most people trace it back to 1947, when Grace Hopper's team literally found a moth in a relay of the Harvard Mark II computer. But according to a Computerworld article, engineers and inventors were already using "bug" to mean a technical glitch long before that — Thomas Edison used it in an 1878 letter, nearly 70 years earlier. [3]

The point is: debugging is not new. What is new is what we're being asked to debug.

Debugging has come a long way from that moth. We've gone from literal insects to stack traces, from print statements to distributed tracing tools like LangSmith.

But what I ran into that day suggests the discipline is doing something more than just evolving — it's expanding into entirely new territory.

In a traditional system, you debug the code. In an AI agent system, you may also need to debug the reasoning — the decisions an LLM made on its own while trying to be helpful, decisions that left no obvious fingerprints in the code itself.

That's a different skill set. And honestly, I'm not sure many of us are fully ready for it yet. I know I wasn't.

What I Think I Learned Today

As the boys in South Park would say, I think I learned something today.

AI agents are not just faster or smarter code. They are autonomous reasoners operating inside your system, and they will sometimes make judgment calls you didn't ask for, didn't anticipate, and might not even notice — unless one of those calls eventually fails.

Three of them didn't fail.

This isn't just a “me problem.” According to LangChain's State of Agent Engineering report (December 2025, 1,340 respondents), the single biggest barrier teams face when scaling AI agents is quality and unpredictable behavior — cited by one third of respondents as their primary blocker. My four-item test was, apparently, a small-scale version of a challenge the whole industry is wrestling with. [4]

If you're a technical leader or entrepreneur beginning to introduce AI agents into your products or workflows, here's the question I'd leave you with: How would you even know if your AI agent improvised its way through something it shouldn't have?

I don't have a clean answer yet. But I'm paying a lot more attention now.

More field notes to come — mistakes, surprises, and all.

Footnotes / Sources:

LangSmith is an observability and debugging platform for LLM applications, developed by LangChain. It provides tracing, logging, and evaluation tools for AI agent workflows.
Deloitte, State of AI in the Enterprise (2026 Report). Survey of 3,235 leaders conducted August–September 2025. https://www.deloitte.com/us/en/what-we-do/capabilities/applied-artificial-intelligence/content/state-of-generative-ai-in-enterprise.html
"Moth in the Machine: Debugging the Origins of 'Bug'," Computerworld (September 3, 2011). The article traces the word "bug" back to Thomas Edison's 1878 usage, as cited in The Yale Book of Quotations (2006), predating the commonly told Grace Hopper story by nearly 70 years. https://www.computerworld.com/article/1537941/moth-in-the-machine-debugging-the-origins-of-bug.html
LangChain, State of Agent Engineering (December 2025). Survey of 1,340 professionals, November–December 2025. https://www.langchain.com/state-of-agent-engineering