Clever Solutions
Article

The Vibe Coding Bill Is Coming Due

Andrej Karpathy named "vibe coding" in February 2025 as a shorthand for prompt-driven development without close inspection. Eighteen months later, the public data on what that produces — security vulnerabilities, regression rates, review fatigue, CVE counts — is no longer ambiguous.

Who this post is for: if your business commissions software — from a vendor, an in-house team, a freelance developer, anyone — but you don't write the code yourself, this post tells you whether you should be worried about how that software is being built in 2026 and what to do about it. There is a vendor checklist and a self-check for non-technical readers further down.

In plain English: "vibe coding" is what happens when developers let AI write code without carefully checking what it produced — prompt, generate, glance, ship. Andrej Karpathy coined the term in February 2025 for a workflow he found delightful in prototypes — and he has been explicit in followups that it is appropriate for "throwaway weekend projects" and "things you don't care about," and that production work requires a fundamentally different posture. The framing was precise. The internet, predictably, did not preserve the precision.

Eighteen months later, vibe coding is the dominant mode of AI-assisted development at a lot of companies that ship real software. By every public measure I can find, the bill is coming in.

One concession up front, because the rest of the post depends on it. Vibe coding is the right tool for prototypes, internal one-off tools, marketing pages, throwaway scripts, and exploratory work — exactly what Karpathy meant. The argument here is about what happens when the same posture gets used for production systems serving real users, customer-facing software, regulated workflows, or any code expected to live past next quarter. That is where the bill arrives.

This post is also not an argument against AI in software engineering. We are an AI-native company. Every line of code we deliver is built with frontier models in the loop. The argument is much narrower: vibe coding wasn't designed for engineering, the bill is now overdue, and the engineering practices that close the gap exist and are knowable.

The receipts

A caveat first: the studies below measure AI-generated code broadly, not vibe-coded code specifically. The two are not identical — a Cursor user who carefully reviews every change is producing AI-generated code without vibe coding it. But the causal arrow is real: when AI generation outruns human inspection, you get vibe-coded outcomes regardless of which tool produced them. The receipts measure what happens at scale.

Security. Veracode's 2025 GenAI Code Security Report found 45% of AI-generated code contains a known security vulnerability — measured across 100+ large language models tested in Java, JavaScript, Python, and C#. Two years of model improvements have not moved the number. The Cloud Security Alliance and Georgia Tech's Vibe Security Radar tracked publicly-attributed CVEs in AI-written code rising from roughly 18 cases across the back half of 2025 to 56 cases in Q1 2026 — with 35 in March 2026 alone, more than all of 2025 combined.

Quality and consistency. GitClear's longitudinal analysis of millions of commits in AI-heavy repositories shows refactoring activity has dropped from 25% of changed lines in 2021 to under 10% in 2024, while copy-pasted ("cloned") code has risen from 8.3% to 12.3% over the same period. CodeRabbit's State of AI vs. Human Code Generation report (December 2025) found AI-written changes contain 1.7× more issues than human-written changes, with 75% more misconfigurations and 2.74× more security vulnerabilities.

The ~50-point trust gap. The single most telling number: Stack Overflow's 2025 Developer Survey puts AI-tool adoption among developers at 84% while only 33% actively trust AI output — the rest range from neutral to actively distrust it. The gap between "we use it" and "we trust it" is the shape of the problem in one statistic. LinearB's 2026 benchmarks across 8.1M+ PRs from 4,800 organizations measured AI-heavy teams experiencing 91% longer code-review times and AI-generated PRs waiting 4.6× longer to be picked up. Forrester's Predictions 2025 projects 75% of technology decision-makers will see their technical debt rise to moderate or high severity by 2026, driven specifically by AI-assisted development.

Industry adoption vs. developer trust STACK OVERFLOW 2025 — WE USE IT vs WE TRUST IT 84% ADOPTION 33% TRUST 51-point gap “We use it” minus “we trust it” — the entire problem in one statistic.
“We use it” minus “we trust it” — the entire problem in one statistic.

None of this should be read as "AI is a failure." All of it should be read as: the code that frontier models produce is shippable only inside a structure that catches what the model gets wrong. The structure is not optional, and most companies do not have one yet.

The expert consensus is structural

The clearest sign this is not a vendor-vs-critic argument is that the foundation-model companies agree. Anthropic, OpenAI, and Google DeepMind have all published research describing the structural limits of agentic coding without external constraints. The most credible practitioners arrive at the same place from different angles — Simon Willison on the necessity of the human staying in active loop, Charity Majors on the difference between syntax and production behavior, Steve Yegge on the review-bottleneck problem, Dan Luu on the hidden costs of generation outrunning governance.

There is no serious working engineer in 2026 claiming frontier models alone produce maintainable production software. The disagreement is entirely about what to do about it.

What governance for AI-generated code has to look like

The first instinct of most teams is to add more code review. This is exactly the wrong answer — review is precisely the bottleneck the data above shows breaking under AI volume. You cannot review your way out of a 91%-longer review-time blowout with the same number of reviewers.

The second instinct is to add lint rules and pre-commit hooks. Closer, but at the wrong moment in the workflow. Catching a violation at commit-time means the agent has already produced the bad pattern, the developer has already merged it into their working set, and unwinding it is now a refactor.

The structure that actually works has four properties. Mature AI-builder teams already do versions of all four — using Cursor rules, Claude Code hooks, AGENTS.md conventions, custom MCP servers, and established tools like Snyk, Sonar, Semgrep, and CodeQL on top. None of the individual patterns are invented. The work is in packaging all four into a single delivery environment that ships them by default. At Clever Solutions, we call this environment CleverADE. The four properties any such environment must have:

  1. The agent gets corrected the instant it writes something wrong — not three weeks later in a pull-request review meeting. The cost of fixing a wrong pattern is lowest at the keystroke. (The technical version: a rules engine running at write-time in under 200 milliseconds. Cursor users do versions of this with .cursorrules. The more comprehensive the default ruleset, the less every project starts from zero — at Clever we ship a packaged ruleset spanning 20+ rule groups, covering security, ORM, layering, error handling, and project-specific patterns, on day one of every engagement.)
  2. The rules go beyond style — they encode actual behavior. "Customer credit card data never leaves this system." "Sessions expire after 30 minutes." "Every payment write is idempotent." These have to be written down once, machine-checked on every change, and travel with the code — not in a policy doc nobody reads. (The technical version: typed invariants in YAML, validated automatically on each file edit. At Clever we call these Intent documents.)
  3. Multi-file changes require a structured plan first. Anything touching more than two or three files has to start by identifying every affected file up front, with constraints per chunk that must pass before the next chunk runs. The "every pass quietly breaks something the previous pass relied on" failure mode cannot happen if every pass is gated. (At Clever we call this plan-first execution.)
  4. The rules and contracts ship in the client's repository alongside the code — not in tribal knowledge, not on the original developer's laptop, not in the vendor's wiki. When the project goes in-house or to another vendor, the standards that governed the build continue to govern maintenance.

In one sentence: the AI generates, the structure governs, and both are versioned together.

What this doesn't fix

Honest about the limits, because anything that "catches everything" catches nothing:

  • It does not catch business-logic bugs. If the requirement is wrong, the code that implements the wrong requirement will pass governance and ship.
  • It does not catch performance regressions that don't violate a written rule. Production-load behavior still needs observability, load testing, and human judgment.
  • It does not catch novel security classes that no rule has been written for yet. Governance protects against known patterns; new attack surfaces still require security review.
  • The governance system itself can be wrong. Rules and intents are authored by humans and rot as the codebase evolves. We treat governance maintenance as part of every engagement, not a one-time setup.

Questions to ask your software vendor this week

If you commission software from a vendor, freelancer, or in-house team rather than writing it yourself, these five questions cost nothing to ask and will tell you more in a 30-minute call than any vendor pitch deck:

1. "How much of the code in our project last quarter was AI-generated, and who reviewed it?"

  • Good answer: a real number ("roughly 60%"), with a description of the review process.
  • Bad answer: "We don't really track that," or vague reassurance about "extensive testing."

2. "What automated checks run on AI-generated code before it lands in our codebase?"

  • Good answer: names tools and patterns — write-time linting, security scanning (e.g., Snyk, Semgrep, CodeQL), behavioral contracts, automated tests, plan-first workflows.
  • Bad answer: "We have code review." Code review alone has been measured insufficient at AI volume.

3. "When was our last security scan, and what did it find?"

  • Good answer: a recent date and a list of findings (even minor ones).
  • Bad answer: "Everything is clean," or "We can run one if you'd like."

4. "If your lead engineer on our project left tomorrow, how would the next person know the standards that govern our code?"

  • Good answer: written rules in our repo, intents/contracts checked into the codebase, a runbook.
  • Bad answer: "Our team would brief them," or "It's all in our internal wiki."

5. "Show me a recent change where your governance caught something the AI got wrong."

  • Good answer: a real example, a few sentences, possibly with a diff.
  • Bad answer: vague generalities. If they cannot name an instance, the governance is theoretical.

The Spaghetti Point self-check

Five yes/no signals that an AI-built project is heading toward what we at Clever call its "Spaghetti Point" — the moment the codebase becomes effectively unmaintainable without a structural reset:

  1. The vendor's velocity has dropped sharply in the last 4–8 weeks (the team that was shipping a feature a week is now shipping once a month).
  2. Bug-fix changes are touching files that have nothing to do with the bug being fixed.
  3. The same bugs keep coming back in slightly different forms.
  4. The vendor has started talking about a "rewrite", "refactor," or "second version" that wasn't in the original plan.
  5. No one on the team can confidently explain why a particular piece of the code makes the decisions it makes — including the people who built it.

Three or more "yes" answers and the project is at or past the Spaghetti Point. The cost of remediation rises sharply month-over-month from there.

Three ways to engage Clever

Three places to start, depending on where you are sitting:

  1. Free 30-minute conversation. Tell us where it hurts. We'll be honest about what we can and cannot fix, and whether Clever is the right team — or whether you should call someone else. No prep, no tech vocabulary required, no obligation. Book it.
  2. Paid second-opinion code audit for a project you suspect is past the Spaghetti Point. A senior engineer reviews the codebase, runs our governance scan against it, and produces a written assessment of where the project stands and what remediation would cost. If we conclude the project is fundamentally sound and you don't need us, we say so and walk away.
  3. New build with governance from day one. Focused workflows typically ship in weeks; multi-system builds in months — with the rules, contracts, and standards committed alongside the code. We'll tell you within 30 minutes of looking at it which bucket your project sits in, and scope cost and timeline on the first call.
Previous

Per-Seat SaaS Is Eating Your Business. AI Just Made the Alternative Real.

Next

Operating Your Shopify Store Should Be a Conversation, Not a Dashboard