Not because the code got worse. Because nobody understands it anymore.
We have a name for when code gets messy over time: technical debt. The industry spent decades building tools around it. SonarQube, code reviews, refactoring sprints, architecture decision records. We got decent at it.
AI broke that entirely. But not in the way most people think. The code isn’t messier. In some ways it’s cleaner than ever. The problem is that no one wrote it.
Here’s what that looks like in practice:
A junior dev uses Cursor to build a payment retry service. It works. Tests pass. Gets merged. Three weeks later, it starts double-charging customers at exactly 3am. The dev can’t debug it because they never understood the exponential backoff logic accepted from the AI. Now, they have to ask Claude to debug code Claude wrote. It suggests a fix. Nobody knows if the fix is right. They ship it anyway.
That’s not technical debt. That’s context debt.
It is a new form of debt, and in research, it’s called comprehension debt. A 2026 arXiv paper defines it as “the growing gap between what a development team knows about their codebase and what they actually need to understand to maintain it.”
The key word is maintain. Not build. Not ship. And unlike technical debt, which lives in the code, comprehension debt lives in people’s heads. You can’t refactor it. You can’t lint it. Sonar can’t catch it. But the tests still pass.
The data is already brutal
We built tools that write code faster than humans can understand it. Then we measured success by how fast the code shipped.
Pydantic put a name to what’s happening: the human reward function problem. Writing code by hand was never easy, but it was full of small wins. Cracking a gnarly bug. Finally understanding why something broke. That feedback loop was what kept you sharp. LLM-assisted programming automated exactly that part.
“The satisfying part shrank. The exhausting part grew. And there are no new rewards to fill the gap.”, Pydantic
This is the debt nobody is paying down. Not because people don’t care. But because we have no way to track its accumulation.
Every metric we have assumes a human understood what they shipped. Lines of code? AI inflates it. Commit frequency? AI inflates it. Test coverage? AI writes the tests too. PR throughput is up. Deployment frequency hasn’t moved. The work looks bigger. The understanding isn’t.
And when that gap compounds long enough, you get something worse than a bug. You get a system nobody can explain.
Bus factor: zero
There’s a concept called the bus factor, the number of people who need to leave before a project collapses from knowledge loss. Most teams worry about a bus factor of one or two.
AI-generated codebases can have a bus factor of zero. Not because people left, but because the knowledge was never formed in the first place. No one needed to quit. The understanding just… wasn’t acquired.
When your team can’t confidently diagnose a production incident on a system they built, same engineers, no turnover, no attrition, that’s not a people problem. That’s a business continuity problem. It just doesn’t show up on any dashboard until something breaks.
More AI is not the answer
The instinct is to solve this with more AI, better context windows, smarter retrieval, bigger memory, scaling in every direction but understanding.
But this isn’t an information problem. The code exists. The AI can read it. Write tests for it. Explain it. The problem is that no one is reading it.
The problem surfaces at 3am when prod is down, the AI’s suggestion looks plausible, and you have nothing in your head to evaluate it against. No ground truth. No intuition about why that service behaves that way under load. That’s where systems fail. Not in the happy path.
Different posture, 25-point gap
The developers who are holding up well aren’t avoiding AI. They’re using it differently. They prompt → read the output → ask why → deliberately break the part they don’t understand → rebuild it → then merge. Slower per commit. Much faster per incident.
Anthropic’s January 2026 study of 52 software engineers (mostly junior) found the AI group averaged 50% on the comprehension quiz. The hand-coding group: 67%. Nearly two letter grades of difference, on a quiz covering concepts they’d just used minutes before.
The most important finding wasn’t the average. It was the patterns. Developers who delegated code generation to AI scored below 40%. Developers who used AI for conceptual questions, asking why, not just what, scored 65%+.
And Anthropic added this footnote themselves: “this setup is different from agentic coding products like Claude Code, we expect the impacts on skill development are likely to be more pronounced than the results here.”
The study they ran was the conservative case. Let that sink in.
We measure what gets shipped. We don’t measure what gets understood.
That gap is going to define which engineering teams are actually durable in three years, and which ones just looked productive.
The measurement layer for this doesn’t exist yet. I think that’s the thing worth zooming in on.
Have you shipped something you no longer fully understand — and does that feel normal now?
Want a reply? Drop your email.
Sent.