β ESSAY
Since AI coding tools swept through the industry, two statements have circulated among developers: "Code gets written much faster" and "But the codebase seems to be getting worse." Often both come from the same person. This article examines that disconnect.
One sentence runs through this entire piece:
AI only amplifies developers to the level they can already judge - the "visible level."
This isn't an argument against using AI. It's a characteristic that multiple studies repeatedly point to. Use it unknowingly, and debt quietly accumulates. Use it knowingly, and you can selectively capture the tool's benefits. What determines this divide is the developer's own judgment standards - the studying they've built up over time. Below, I'll organize public data across three areas: (1) productivity, (2) quality & security, and (3) technical debt, then add personal observations at the end.
The most frequently cited numbers come from a randomized controlled trial (RCT) published by GitHub and MSR/Microsoft in 2022. They had 95 developers implement HTTP servers in JavaScript, comparing groups that used Copilot versus those that didn't.
This study's limitation lies in its experimental design. The task involved writing an HTTP server from scratch in a new file, with average work times of 1-2 hours. There were no existing codebase constraints. Simply put, these were the most favorable conditions for AI. While this "55%" has been cited most frequently over the past three years, follow-up studies at the same scale are rare. There's been ongoing criticism of using this number as a general metric without specifying it as task-specific results.
A follow-up enterprise environment RCT by Accenture and GitHub in 2024 used 450 developers as the experimental group and 200 as controls.
This study also has methodological points to consider. Most metrics measure how much code gets submitted - throughput indicators like "PR count" and "merge count" - but don't show how long that merged code survives long-term. Critics pointed out that increased throughput and improved long-term quality are different questions.
Zoominfo's 400-person enterprise case study published in 2025 compiled numbers from actual development environments rather than controlled labs.
What stands out in Zoominfo's case is the surprisingly low acceptance rate. 67% of suggestions get rejected - a number that's closer to real-world usage patterns than lab RCTs.
There are results pointing the other way. Uplevel Data Labs' 2024 field observation report examined objective metrics (cycle time, PR throughput, bug rates, overtime hours, etc.) for about 800 developers at their client companies.
Uplevel interpreted this as "Copilot may negatively impact code quality." The measurement period was about 3 months, with developers doing their usual daily work. This is closer to a long-term field observation design rather than controlled lab tasks. The reason for different direction from GitHub 2022's +55% spans both task characteristics and observation period.
METR's randomized controlled trial (RCT) published in July 2025 gave 246 issues to 16 experienced open-source maintainers, randomly allowing or blocking AI use per issue. Tools were mainly Cursor Pro + Claude 3.5/3.7 Sonnet, working on old repositories the participants normally maintained.
They felt 20% faster but were actually 19% slower. Even after completing tasks, developers still thought they had gotten faster.
METR's February 2026 update provided different numbers:
METR also warns of possible selection bias, explaining that "the same developers likely improved their AI handling skills over the year."
The 2024 Stack Overflow Developer Survey β AI, a self-report survey of tens of thousands of developers worldwide, produced these numbers:
A year later, the 2025 Stack Overflow Developer Survey showed these changes:
In summary, usage increases while trust decreases - a trend that continued throughout 2024-2025.
The studies above don't contradict each other. They simply show different intersections of two axes: the judgment requirements of tasks and users' judgment standards.
Task axis. Writing an HTTP server in a new file (GitHub 2022) versus fixing issues in old repositories with context (METR 2025) require different amounts of judgment from AI. The former has short requirements and no surrounding code constraints, so AI can complete it almost alone. The latter requires working within existing structures and edge cases, where users must already know where to intervene.
Skill axis. The group that benefited most in GitHub 2022 was "developers with less experience." This seems to contradict this article's thesis that "you need judgment standards for AI to amplify well." But in formalized, narrow tasks, novices' judgment gaps don't get exposed because there's little room for judgment anyway. The task doesn't require much judgment to begin with. In contrast, METR 2025's old repository tasks were conditions where the presence or absence of judgment standards emerged as a key variable, and even skilled developers recorded -19%. The two results are different sides of the same mechanism.
Measurement metrics axis. There's a structural difference between self-reports (Zoominfo, Stack Overflow) and objective measurements (METR, Uplevel). Even the GitHub 2022 study showed divergence between subjective and objective metrics.
Ultimately, it's not a single proposition of "AI makes you faster/slower," but a conditional proposition where task judgment requirements intersect with user judgment standards. When high judgment requirements combine with shallow judgment standards, the quality, security, and debt results we'll see next start to emerge in earnest.
Unlike productivity, quality and security research results consistently lean in one direction.
This academic empirical study by Pearce et al. in 2021 was the first representative paper to examine Copilot's security characteristics.
"Vulnerable" here means reproducing defect patterns defined in CWE. This paper was later published in IEEE S&P 2022 and Communications of the ACM. The authors particularly emphasized that Copilot follows the context of users' surrounding code. If surrounding code already has security defect patterns, Copilot inherits and recommends those patterns. In vulnerable codebases, AI amplifies those vulnerabilities.
Perry et al.'s Stanford controlled experiment (2022, arXiv:2211.03622) shifted focus. Instead of AI itself, they looked at output from "people who used AI."
Actual security levels decreased while subjective confidence increased. The two directions were opposite.
A follow-up academic empirical study published in ACM TOSEM collected 733 snippets identified as Copilot-generated from public GitHub projects and ran static analysis.
Security platform vendor Apiiro's 2025 field observation report used their proprietary Deep Code Analysis engine to scan tens of thousands of repositories and thousands of developers at Fortune 50 companies. Keep in mind these numbers are based on the vendor's own standards.
Apiiro summarized this as "problems shifting from shallow syntax errors to deep architectural flaws." Shallow bugs get caught by linters and tests, but architectural flaws have virtually only humans to catch them. This direction aligns with GitClear and Sonar results we'll see later, but specific magnitudes like 322% and 10x depend on Apiiro's proprietary measurement standards. However, considering that defect rates across four studies fall in the 24-40% range, and Stanford's reported tendency for "AI users to believe their code is more secure," it's hard to support expectations that "better models will naturally solve this."
Commit analysis tool vendor GitClear's 2025 longitudinal analysis report tracked 211 million lines of code changes from 2020 to 2024.
GitClear summarized this trend as "incoming code becoming increasingly disposable."
Google's DORA reports measure software delivery performance annually based on surveys of thousands of developers.
Key numbers from the 2024 report:
In the 2025 report, some metrics flipped while others remained:
DORA summarizes this as:
"AI doesn't fix a team; it amplifies what's already there."
DORA's "AI amplification conditions" are classic software engineering fundamentals: platform engineering, automated testing, short feedback loops, and value stream management (VSM). Teams that introduce AI without these conditions see results skewing toward worse stability.
Static analysis tool vendor Sonar's State of Code Developer Survey 2025 - a developer self-report survey - produced these numbers:
However, the "one or more negative impacts" question is comprehensive, so the 88% figure doesn't directly indicate defect severity. Still, Sonar calls this the "great toil shift." Work weight moves from the coding side to the maintenance side.
Debt Behind the AI Boom: A Large-Scale Empirical Study of AI-Generated Code in the Wild, published on arXiv in 2026, is the largest commit-level study to date.
That last number is the most practically significant. About a quarter of issues introduced by AI remain undiscovered or unfixed over time. Recently, this is being discussed as "comprehension debt." When no one originally wrote that code directly, the cost of opening and understanding it doesn't become reason enough to fix it while setting aside other work. Refactoring proportion decreases, duplicate code and churn increase, stability maintains negative correlation, and issues that enter once stay. Deployment volume increases while maintainability decreases.
When you put results from all three areas together, they seem contradictory on the surface. Productivity goes up or down depending on conditions, quality is consistently bad in one direction, and debt accumulates. But there's the same mechanism behind this divergence.
AI amplifies the judgment standards developers already have. In areas where standards exist, it multiplies speed. In areas without standards, it pours out unvalidated code as-is.
What DORA said at the team level - "AI doesn't fix, it amplifies" - applies to individuals too. Stanford's AI users believing their code was more secure, METR's skilled developers misjudging their own time, and Stack Overflow's usage rates and trust levels going in opposite directions all meet at the point of not seeing the boundaries of one's own judgment standards.
From this point on, this isn't empirical measurement but subjective observation. As I've been using AI coding tools daily and talking with fellow developers, I often find that the direction these data points indicate aligns closely with my own sense of things.
First, there's a feeling that product release speed and product quality move in opposite directions. AI makes development faster, shortening release cycles, but the products that ship this way often seem rough around the edges. In frontend development, customers ultimately pay this price. I've noticeably seen more interfaces with janky interactions, broken consistency, and poor accessibility. Yet the industry-wide demand remains "ship faster." It's unclear who will maintain quality in this gap.
I've repeatedly witnessed junior developers losing opportunities to think deeply. During code reviews, I encounter PRs with clean implementations, but when I ask "why did you write it this way?" no answer comes. It's obvious they read an AI response and submitted it as-is. The pressure to keep up with fast cycles reduces time to directly design problems, get stuck, and rewrite solutions. The insights that come from forming hypotheses and being wrong don't accumulate. Research from Anthropic in 2026 shows this intuition in numbers. In an experiment learning a new library, the group allowed to use AI averaged 50% on comprehension tests, while the group that coded from scratch averaged 67%. Within the AI group, those who completely delegated code generation scored below 40%, while those who used AI only for conceptual questions scored above 65%. The AI group completed tasks about 2 minutes faster, but this wasn't statistically significant. Productivity barely improved while learning decreased. Anthropic calls this difference "cognitive engagement vs. cognitive offloading." The former involves solving problems with your own mind and using AI only for explanations or hints; the latter delegates problem-solving entirely to AI and just takes the results. Even with the same tool, how you hold it determines what remains in your head.
As codebases grow, the areas AI can't properly see also expand. In structures spanning hundreds of files and multiple services, AI only looks around the given files. It focuses on the feature that needs to be added but doesn't touch the debt areas right next door. Sometimes it creates code that avoids that debt, leaving the original debt untouched while building new workaround logic. This pattern shows up frequently in reviews. The modified files are clean, but duplication or strange state management in adjacent files remains untouched. Features get added but the codebase becomes slightly more tangled.
The weaknesses in code I didn't write myself are also less visible. With code I typed, I have a sense of where things are precarious. I run tests more thoroughly on those parts and explain them once more during reviews. But with AI-written code, if it reads fine, I tend to submit it as-is. When linting and tests pass, "looks good" becomes the default. There's another layer to this: delegating PR reviews of AI-written code to the same AI model. When the author and validator are the same model, their blind spots are also the same. The range of missed defects overlaps perfectly, but the PR only shows the signal "AI checked it, so it's fine." The tendency Stanford reported about "believing your own code is more secure" happens simultaneously for both author and reviewer. The sense that undetected debt accumulates only in numbers aligns exactly with the 24.2% persistence rate from arXiv 2603.28592 mentioned earlier.
Looking at these scenarios together, the question we should be asking isn't "how fast can we generate code with AI" but "where should we redirect the time that speed has freed up." Tools getting faster doesn't make my judgment faster too.
For the above observation to become an argument rather than just a statement, "so what should we do" must be clear. My answer is learning. There are three reasons.
First, the range where I can name problems is the ceiling of AI utility. If I throw "fix the performance issues" at the same AI, it doesn't know what to touch and just spends time browsing the codebase before giving up. But if I narrow it down to "fix the performance issue caused by bundle bloat from dependency duplication," it quickly arrives at root cause identification and fixes. "Fix the issue showing up in Sentry" produces much less useful results than "fix the concurrency issue on this screen." That specificity comes from my ability to diagnose problems as "dependency duplication" or "concurrency issues." The resolution of instructions determines the resolution of results, and that resolution comes from the range of names I can assign. Problems I can't recognize are hard to ask AI about precisely and hard to properly validate when results come back. When I stop, AI utility's ceiling also stops there.
The final line of defense against debt is also the judgment of the person reading the code. The number that 24.2% of issues introduced by AI survive without being fixed (arXiv 2603.28592) means defects exist in reality that linters, tests, and CI all miss. The only final mechanism that can catch these is the one person reading that code at PR time. The power to break the tendency for AI users to overconfident in their own code, as Stanford found, also comes from knowledge and the habits of suspicion built on top of it.
Finally, learning compounds. When METR re-ran the same experiment with the same 10 developers a year later and saw movement from -19% to +18%, METR itself interprets this as "participants' AI usage proficiency accumulated." This means changes occurred only on the human side during a period when there were no major changes on the model side. The opposite direction can unfold at similar speed. Since it's a difference made by small daily accumulations rather than one big resolution, it's not easy to make up for missed periods in a rush.
This isn't the time to marvel at yourself for rapidly churning out deliverables with AI. The gap between METR's measured subjective +20% versus actual -19% points exactly here. When the feeling of "I produced this much today" is strongest, actual code quality and my growth are likely moving in the opposite direction.
So "where to spend the speed gained from AI" becomes the next question. The common answer these days is parallelization. Open multiple AI sessions to work on multiple tasks simultaneously. Short-term output definitely increases, but none of the three axes above grow. Instructions become rougher, validation becomes shallower, and learning gets postponed. The opposite direction is "less but deeper." Focus on one or two tasks, doubt AI-generated results once more, directly examine structure and edge cases, and when unknown areas emerge, study just those parts separately. Spending the same time, this approach provides input to all three axes. What remains for individuals long-term is also this side.
The three don't move separately. When instruction accuracy rises, validation becomes easier; when validation becomes easier, learning accelerates; when learning compounds, the ceiling opens. If even one axis crumbles, the rest gets pushed along. There are limits to waiting for organizational-level improvements. The most realistic lever individuals can grab right now is learning.
The materials I've examined so far haven't all leaned one way. Evidence that "AI is fast" and evidence that "AI makes things slower" came out together in the same year. Only quality and security consistently show negative results, while technical debt indicators have only recently started aligning in the same direction. In that sense, it might be too early to draw conclusions.
Still, one thing emerges repeatedly from the middle of this contradiction: AI doesn't replace my level. Tools amplify me only to the extent of my knowledge, and in areas I don't understand, they just pile up unvalidated code. There are only three responses individuals can grab immediately: changing tools, changing processes, or expanding judgment criteria themselves. The first two require organizational movement, which is slow. The last one is the lever I can grab right now.
The remaining question is one line:
Where are the boundaries of my judgment criteria, and who is expanding those boundaries right now?
Only the developer themselves can answer this question. The answer is still learning.