A familiar pattern is emerging in AI-assisted software development.

A broad task is handed to an agent. The agent runs for a long time, sometimes overnight. It edits confidently, explains confidently, and eventually reports success. Only later does the developer discover that the change is partial, misaligned, or structurally wrong.

The visible cost is bad code. The larger cost is worse: wasted runtime, wasted model spend, false confidence, scope drift, and a cleanup burden that grows with every plausible but unaccepted change.

This is not a strange edge case. It is a normal operating pattern in many AI coding workflows now.

The name for it is open-loop execution.

In engineering, an open-loop control system acts without sensing the state of the thing it is trying to control. That is exactly what happens when an agent is allowed to keep working without meaningful external acceptance. The system keeps moving because nothing outside the model tells it whether it is moving in the right direction.

The result may look like autonomy. In practice, it is uncontrolled expenditure attached to plausible output.

Verification is not cleanup

Most developers still think about verification as something that happens after the agent works.

The agent produces. A human reviews. A test suite may run. A diff is inspected. Verification, in this view, is hygiene: the cleanup step at the end of the real work.

That model is backward.

Verification is not the thing you do after the agent works. It is the thing that determines whether the agent worked at all.

Until an external check accepts the result, the output is provisional. It may be useful, promising, even mostly correct. But it has not yet become accepted work.

This changes where trust lives. Trust no longer lives in the model’s confidence, the fluency of the explanation, or the length of the run. It lives in the acceptance structure around the work.

It also changes what completion means. “Done” cannot mean that the agent stopped, that the agent said it finished, or that the diff looks reasonable at first glance. Completion must be decided outside the model.

Generation-centric thinking misses this architectural issue. Prompts, models, and context windows are visible and easy to compare, so they dominate the conversation. When the system fails, the first response is usually to improve the prompt, increase context, or use a stronger model.

Those may help. They do not solve the control problem.

A stronger model in an open-loop architecture can still produce a longer, more confident, more expensive mistake. The missing piece is not more fluent generation. It is external judgment.

The cost of open-loop execution

Open-loop execution becomes expensive because wrong assumptions compound.

An agent starts with an interpretation of the task. If that interpretation is slightly wrong, and nothing checks it early, every subsequent step may deepen the error. Files are edited. New abstractions appear. Tests are adjusted or ignored. Explanations become more elaborate. The diff grows.

By the time a human reviews the result, the question is no longer “Is this correct?” It is “What happened here?”

That is a much more expensive question.

Long runs make the problem worse. The system continues because nothing external tells it to stop. Confidence rises faster than correctness. The operator sees activity and mistakes it for progress.

The bill scales with duration whether or not the work is real.

A system is probably suffering from open-loop execution when long runs are judged mainly by how much the agent did; when acceptance is deferred until the end; when “done” means “the agent stopped”; and when misalignment is discovered late, socially, and painfully.

The problem is architectural, not merely model-related. Better models reduce some errors. They do not remove the need for a closed loop.

What verification changes operationally

Verification changes the economics of AI-assisted development because it gives the system an external accept/reject signal.

That signal does more than catch defects. It changes which models can be used, how long runs can continue, when retries make sense, and when escalation is justified.

Weaker models become usable

Without verification, even strong models are unsafe. Their output may be plausible, but plausibility is not acceptance.

With verification, weaker models become usable in bounded workflows. Not because they have become wise, and not because cheapness is the main prize. They become usable because their failures can be detected.

The work is no longer trusted directly. A cheaper or weaker model can attempt a task, but the system does not have to believe it. It can accept, reject, retry, or escalate based on the result.

This expands the range of acceptable model choices. A model that fluently writes a broken migration is no longer hidden by its prose. A backend that passes contract checks on small refactors but fails fixture runs on data changes can be routed accordingly. Models compete on accepted outcomes, not reputation or anecdote.

A model is not valuable because it sounds capable. It is valuable when its output can survive the acceptance layer: tests, compile, contract checks, and the other structures introduced below.

Longer runs become rational

The same principle applies to duration.

Without verification, a long autonomous run is often just a longer path to a more expensive mistake. Each extra step gives the system more room to elaborate on a false premise.

With verification, longer runs become governable. The system can checkpoint progress, interrupt the run when acceptance fails, and measure progress by validated states rather than elapsed activity.

This does not make long runs magically safe. It makes them accountable.

A long run with acceptance gates is different in kind from a long run that merely keeps producing. One is controlled execution. The other is hope with a bill attached.

Weaker models and longer runs are the same architectural fact viewed on two axes: model quality and execution duration. Both become tolerable only when output is externally accepted or rejected.

Retry, routing, and orchestration need verification

Verification is also what makes the rest of the reliability stack rational.

Without verification, retry is mostly a habit. The agent fails vaguely. The operator retries because the output feels wrong. Another version appears. Perhaps it looks better. Perhaps it only sounds better.

Retry without verification is just spending more money to get more confident-sounding output.

A failed check changes the situation. It gives the system a stable non-acceptance signal. Retry can attach to a known failure state. Escalation can become policy-driven. Stopping can become rational when acceptance still fails.

Routing has the same dependency. Without verification, routing is preference or superstition. One model is chosen because it feels stronger. Another is chosen because it performed well last week. A third is chosen because it is cheaper.

With verification, routing can be governed by outcomes. Which model passed this class of task? Which failed this check? Which should be escalated? Which should not be trusted with this work unit?

Orchestration also depends on the same signal. Without verification, orchestration is choreography around untrusted output. Agents can hand work to other agents. Supervisors can summarize. Plans can be decomposed. None of that matters if the system lacks an acceptance surface.

Retry needs failure signals. Routing needs outcome signals. Supervision needs acceptance signals. Blast-radius controls matter precisely because failure is expected and measurable.

That makes the practical question concrete: what does a verification layer actually look like?

What verification looks like in practice

Verification is not a vague instruction to “be careful.” It is a mechanized acceptance layer attached to work units.

In code, that layer can take many forms: tests, compile success, build success, lint checks, type checks, migration checks, schema validation, contract validation, fixture runs, targeted acceptance scripts, and domain-specific validation.

The exact form varies. The principle does not.

Work is provisional until acceptance passes.

A dependable system treats the verifier as a separate stage with a clear pass–fail result. A task that runs successfully but fails the verifier is not accepted, regardless of how confident the executing agent was.

This matters because the common objection is too narrow: “This only works in codebases with strong tests.” Strong tests help, but verification is broader than unit tests. A compile check is verification. A type check is verification. A migration dry run is verification. A schema check is verification. A small script that proves the changed behavior against a fixture is verification.

Partial verification is still valuable. It does not prove everything. It does bound the cost of being wrong.

That point is easy to underestimate. If an agent is running autonomously for hours, even one contract check is worth more than no check at all. It may stop a bad run early. It may expose a false assumption before the diff becomes archaeological work.

The useful question is not “Can we verify everything?”

The useful question is “What external signal can decide whether this unit of work should continue, retry, escalate, or stop?”

That is the beginning of dependable AI-assisted development.

Verification is one of two layers. The other is the structure that constrains what the agent attempts in the first place: the task contract, the architectural constraints, and the explicit criteria the work must satisfy. That layer narrows the space of possible outputs before any work is done. Verification decides whether what got produced should be accepted.

Both layers are necessary. A system with only verification spends too much effort producing rejectable work. A system with only constraints produces unverified work that may still drift in ways the constraints did not anticipate. This article is about the verification layer. The other layer deserves its own treatment.

What fragile systems do instead

Fragile systems treat verification as supporting hygiene rather than control structure.

They have recognizable symptoms.

“Done” means the agent stopped. Long runs feel impressive but are hard to trust. Model upgrades are the first answer to unreliability. Retries are ad hoc and frustration-driven. Human review is overloaded because there is no stable acceptance layer. Failure is discovered late, usually by a person, and often after the system has produced too much to inspect cheaply.

These systems may still produce useful work. The problem is not that they always fail. The problem is that they cannot reliably tell when they have failed.

A system that cannot detect failure will eventually convert plausibility into confidence. It will reward activity because activity is visible. It will mistake duration for depth.

Completion must be decided outside the model

Autonomy and dependability are not in tension. They are coupled through the acceptance gate.

Without it, autonomy is unsupervised expenditure. With it, autonomy becomes the natural mode of operation: tasks run, results are accepted or rejected, and the system advances or corrects.

The dependable systems are not less autonomous. They are autonomous in a way the operator can trust.

Verification is the control mechanism that turns model output into accepted work. It makes weaker models usable workers. It makes longer runs rational operations. It gives retry, routing, and orchestration something to attach to.

If verification is weak, everything else is compensation.

The agent may be faster. The model may be stronger. The orchestration may be more elaborate. But without external acceptance, the system is still asking the model to judge the value of its own work.

Retry, routing, blast-radius containment, durable state, and task framing each deserve their own treatment. The reason to start with verification is that none of them is rational without it.

Dependable AI-assisted development begins when completion is decided outside the model.