Subagent Driven Development: What the research actually shows

For almost four years now I've been shipping code with agents. From autocomplete in VS Code to major coding agents like Claude Code to advanced patterns like subagent-driven development and agent teams.

When I started, context windows were tiny. That drove the necessity for development teams to create the concept of subagents — agents spawned from the main conversation to do work with clean context windows, unpolluted by the main thread, in an attempt to minimize hallucination and maximize task completion accuracy.

After several months of primarily using subagent-driven development (over single-session development) with the Superpowers plugin, I started to qualitatively notice a significant increase in error rates combined with a meaningful drop in code coherence across different modules — even across the simplest applications. This was on Superpowers 5, which added subagent-driven development as the mandatory default on capable harnesses and introduced mitigations against recursive subagent spawning — a sign that even the framework's author was seeing coordination problems.

This is all while running code-review and /simplify as often as possible (which are also subagent-driven), mind you.

So I did some research. I was unsurprised to find significant evidence for code quality degradation in subagent-written code. The quality drop seems to be primarily driven by a lack of specific architectural and tacit context — the accumulated work and decisions from earlier in the session that the subagent simply never sees.

Written another way: subagents spawn with no previous memory, do work, and return the result. They don't have broader access to the main conversation, folders of .md context written by compound engineering skills, architectural patterns, code patterns, or anything else. They execute. Return code. That's it.

There are excellent use cases for subagents where the task doesn't have a knowledge-graph dependency on prior information. Web search, code review, file exploration — these are prime candidates. But for implementation work where every function needs to understand the shape of what came before it, the isolation that makes subagents clean is the same thing that makes them unreliable.

What follows is a summary of what I found in the literature — academic papers, production failure analyses, benchmark data, and practitioner reports — and what it means for how we're architecting agent interoperability at LORF.

The metric I care about is the rate of error-free code that maintains feature and functionality contracts. Not speed. Not completeness. Not impressiveness. Does the code do what it was supposed to do, without breaking what was already there?

#What the failure research says

The most rigorous analysis I found is a 2025 study from UC Berkeley by Cemri, Pan, and Yang. They analyzed over 1,600 multi-agent system traces across ChatDev, MetaGPT, HyperAgent, and other production systems, and categorized every failure they observed. The study has since been independently analyzed by Augment Code, Galileo AI, and Raghunandan Gupta, all arriving at similar conclusions.

The failure rates ranged from 41% to 87% depending on the task domain. That's a wide range, but even the low end is striking — these are purpose-built systems, not toy demos.

More interesting than the overall rate is where failures come from:

Multi-agent system failure distribution

Percentage of total failures by category across 1,600+ MAS traces

Source: Cemri et al., 'Why Do Multi-Agent LLM Systems Fail?', UC Berkeley, 2025

The thing that stood out to me: nearly 80% of failures come from specification and coordination problems. Not model capability. Not context window limits. Not rate limiting or infrastructure. The system design itself is the primary source of error.

This matters because most people trying to improve multi-agent reliability focus on getting a smarter model or a bigger context window — which addresses the smallest failure category.

Here's the full taxonomy of where things break down, drawn from the Berkeley study and practitioner reports:

Where subagent-driven development breaks downFailure taxonomy from research + practitioner reports

Specification failures (25.4%)

Ambiguous task decomposition

Under-specified interfaces

Wrong problem granularity

Scope creep by subagents

Coordination failures (36.9%)

Context loss between agents

State synchronization drift

Conflicting assumptions

Information withholding

Verification gaps (21.3%)

Shallow "looks good" reviews

Semantic bugs pass syntax checks

Test-code co-generation bias

Premature task completion

Infrastructure issues (~16%)

Context window exhaustion

Token cost explosion (4–220×)

Rate limiting mid-session

Non-deterministic variance

Compound into

41–87% failure rate in production MASBerkeley study across ChatDev, MetaGPT, HyperAgent, and others

79% of failures = specification + coordinationNot model capability — system architecture

#The specific failure modes I found

I want to walk through these in some detail, because understanding how subagent systems break is more useful than just knowing that they break.

#Context loss between agents

This is the most well-documented failure mode, and I think the most important one for anyone building delegation systems.

Every subagent starts fresh. The parent agent has to manually construct the context each worker needs — the relevant parts of the codebase, the design decisions made so far, the conventions in use, the constraints from other parts of the system. Whatever the parent doesn't include, the subagent doesn't know.

Anthropic's own Agent Teams documentation is honest about this:

The Berkeley study broke this down into specific sub-modes (ref 1): unexpected conversation resets (2.2% of all failures), proceeding with wrong assumptions instead of asking for clarification (6.8%), gradual task derailment (7.4%), and one agent withholding information that another agent needs (0.85%).

These numbers look small individually, but they compound. The researchers documented cases where a correct solution proposed in an early round was lost during summarization before being passed to the next agent — what Fan et al. describe as a snowball effect, where a small context gap causes irreversible downstream errors.

A single session doesn't have this problem. It has full access to its entire conversation history. Every decision, every constraint, every edge case lives in the same context window.

#Specification ambiguity

The parent agent has to decompose work into task descriptions that are precise enough to prevent scope creep and broad enough to be actionable. In practice, this is genuinely hard.

The Superpowers framework's spec reviewer is designed around the assumption that subagents will regularly go off-script. The reviewer's instructions explicitly say not to trust the implementer's report, to read the actual code line by line, and to check for both missing requirements and extra work that wasn't requested.

That level of adversarial verification exists because the problem is frequent. Steve Kinney documents several of these patterns in his sub-agent anti-patterns guide: inconsistent activation (agents ignore their brief unless you name them explicitly), state loss on rejection (declining a draft spawns a fresh agent that loses all prior context), and non-deterministic variance (the same task with the same agents yields different results run to run).

When a subagent has partial context and room for interpretation, it fills gaps with its own assumptions. The framework handles this through review gates — but each review is itself a subagent with limited context, so the review process has the same failure modes it's trying to catch.

Failure modes from the MAST taxonomy (Cemri et al., 2025):

FM-1.1  Disobey task specification     →  Silently drops constraints
FM-1.2  Disobey role specification     →  Reviewer writes code instead of reviewing
FM-1.3  Step repetition                →  Re-does completed work
FM-1.4  Loss of conversation history   →  Reverts to earlier state
FM-1.5  Unaware of termination         →  Doesn't know when to stop

FM-2.2  Wrong assumptions (6.8%)       →  Proceeds without clarifying
FM-2.3  Task derailment (7.4%)         →  Gradually drifts from objective
FM-2.6  Reasoning-action mismatch (13.2%) →  Thinks one thing, does another

#The executability gap

This one surprised me. A comprehensive empirical evaluation across seven agent frameworks found that multi-agent systems and single-agent systems have an interesting tradeoff: multi-agent produces more complete code (it addresses more of the stated requirements), but single-agent produces more executable code (it actually runs).

The completeness-executability tradeoff

Normalized scores across evaluation dimensions (higher is better)

Source: Synthesized from empirical evaluations across multiple studies (2025–2026). Illustrative.

The researchers attributed this to inter-agent hallucinations (Zhang et al., 2025) — when multiple agents interact, the increased token volume can exceed context limits, causing information loss. And inherent hallucination rates compound across agents: each agent introduces a small error probability, and those probabilities multiply across the chain.

This is a meaningful finding for the metric I care about. If the question is "does it do what it's supposed to do," multi-agent systems generate code that looks right on paper but fails when you run it. They're better at checking boxes on a requirements list, worse at producing something that actually works.

#Token economics

The cost dimension is worth looking at honestly, because it's not just a budget concern — it's a signal about how much redundant work the system is doing.

A 2025 analysis found that multi-agent systems consume 4–220x more input tokens than their single-agent counterparts. Even in an ideal scenario with perfect context reuse (which never happens), MAS still requires 2–12x more tokens.

Token consumption scaling with agent count

Relative token usage (single agent = baseline 10K tokens). Growth is superlinear due to context reconstruction.

Source: Aggregated from practitioner reports and MAS efficiency studies.

The reason the growth is superlinear rather than linear: context reconstruction. When agent B needs to understand what agent A did, it doesn't just receive a summary — it needs enough context to reason about the work. That context includes the original task description, the relevant codebase sections, and the intermediate results, all re-encoded into a new context window. As the number of agents grows, the amount of redundant context encoding grows faster. Augment Code's analysis of multi-agent coding systems reports that multi-file tasks drop to around 19% accuracy compared to roughly 87% for single-function tasks — largely because bigger tasks exceed the agent's effective working set.

Practitioners report that a 4-agent parallel session on a large codebase can consume 300K–500K tokens in under 30 minutes, which hits Claude Pro rate limits mid-session.

#Where multi-agent does outperform

I want to be careful here, because the research isn't one-sided. There are specific patterns where multi-agent setups produce measurably better results.

The clearest evidence comes from AgentCoder, which separates code generation, test generation, and test execution into three agents. On HumanEval, the multi-agent setup achieved 79.9% pass@1 compared to 71.3% for a single agent doing both. The gap is even larger for test accuracy — 87.8% vs 61.0%.

AgentCoder: where separating concerns helps

pass@1 and test accuracy — independent test generation eliminates self-testing bias

Source: Huang et al., 'AgentCoder: Multi-Agent Code Generation with Iterative Testing and Optimisation', 2024

The reason this works is interesting: when a single agent generates both code and tests, its tests are biased by the same assumptions that shaped the code. If the code misses an edge case, the tests miss it too — they're products of the same reasoning context. Separating the test writer from the code writer breaks that bias.

Beyond test generation, multi-agent setups also perform well in a few other specific scenarios. Parallel code review — where separate agents evaluate security, performance, and test coverage independently — avoids the tendency for a single reviewer to anchor on whatever it notices first. Both Addy Osmani and the Builder.io team identify this as the strongest practical use case for agent teams. Competing hypotheses for debugging, where multiple agents pursue different theories simultaneously, can surface root causes faster than depth-first investigation. And genuinely file-independent work (frontend components, backend endpoints, and database migrations that don't share files) benefits from parallelism without paying the coordination tax.

The common thread across all of these: the agents don't need to coordinate on shared state. The moment they do — the moment agent B needs to know what agent A decided — the coordination overhead starts eating the gains.

#The compilation study

One paper shifted my thinking on this more than the others. A January 2026 study asked a simple question: what happens if you take a working multi-agent pipeline and "compile" it into a single agent with the same capabilities?

The researchers encoded each agent's behavior as a "skill" — a structured prompt that a single agent could invoke sequentially. They then compared the compiled single-agent version against the original multi-agent system across multiple benchmarks.

+0.7%

Avg. accuracy change (SAS vs MAS)

-53.7%

Avg. token reduction

-56.2%

Token savings on GSM8K

The compiled single-agent system matched or slightly exceeded the multi-agent system's accuracy — within −2.0% to +4.0% across all benchmarks, with an average improvement of +0.7%. Token consumption dropped by 53.7% on average.

This is worth sitting with. The multi-agent overhead — the coordination, the context reconstruction, the specification work — doesn't buy you accuracy. It buys you parallelism. And parallelism only helps when tasks are genuinely independent.

#What this means for how we're architecting agent interoperability at LORF

This research has direct implications for our delegation architecture. The v0.5 delegation bus — where front-desk delegates tasks to researcher — introduces exactly the failure modes documented in these studies. Context loss at the delegation boundary. Specification ambiguity in task objects. Verification gaps when the receiving agent operates with partial information.

We've been designing around some of this already, which is reassuring:

The shared memory layer (Recall integration with role scoping) means agents read from the same brain rather than reconstructing context from scratch. Delegation-aware cost tracking makes the token overhead of coordination visible rather than hidden. And human approval gates catch high-stakes errors before they compound through the delegation chain.

But the v2.0 vision of context-efficient delegation — where native agents pass references to shared memory rather than copying context — is where this research really validates our direction. The compilation paper's core finding is that shared context eliminates the multi-agent accuracy penalty. That's exactly what native memory integration does.

The delegation cost equation:

cost(delegation) = cost(context_reconstruction)
               + cost(specification_ambiguity)
               + cost(coordination_overhead)
               + cost(verification_gaps)

When shared memory eliminates context reconstruction:
cost(delegation) ≈ cost(coordination_overhead)

This is what our v2.0 architecture targets with native memory integration.
The 3.5x token overhead from context duplication drops to near zero.

The principle I'm taking from this: delegate when you can share state cheaply. Keep it in one head when you can't. And build the infrastructure that makes state-sharing cheap.

#Conclusion

So: is a singular parent Claude Code session more effective than agent teams or subagent-driven development for coding tasks?

Measured by the rate of error-free code that maintains feature and functionality contracts — yes, for the majority of coding tasks. The research is fairly consistent on this.

The Berkeley study found that once a single agent was given final authority over a task that consistently failed under multi-agent coordination, the task succeeded. That's not a finding about model capability — it's a finding about the cost of coordination.

And that cost isn't shrinking. Fan et al. found that the multi-agent advantage diminishes as LLMs grow more capable — context windows expand, models get better at using long contexts, and the accuracy gap narrows. But the coordination overhead stays roughly constant. Anthropic's own Agent Teams documentation draws a similar line: for sequential tasks, same-file edits, or work with many dependencies, a single session is more effective. Their 2026 Agentic Coding Trends Report notes that isolated context windows per specialist still require shared artifact stores and coordination protocols to avoid drift. Morph LLM's comparative testing of 15 coding agents reinforces this: the same model scored 17 problems apart in different agent frameworks. Scaffolding matters more than model choice.

For us at LORF, this means our delegation bus needs to be designed with the understanding that delegation has a real cost in reliability. The architecture should make it easy to share state (reducing the biggest failure category) and make the cost of delegation visible (so operators can make informed decisions about when to delegate and when to keep work in a single agent). That's what we're building toward.

#References

Cemri, M., Pan, M.Z., Yang, S. — "Why Do Multi-Agent LLM Systems Fail?", UC Berkeley (2025)
Huang, D. et al. — "AgentCoder: Multi-Agent Code Generation with Iterative Testing and Optimisation" (2024)
Vincent, J. — Superpowers: Subagent-Driven Development skill (2025)
Anthropic — "Orchestrate teams of Claude Code sessions" (2026)
Kinney, S. — "Common Sub-Agent Anti-Patterns and Pitfalls", Developing with AI Tools (2026)
Augment Code — "Why Multi-Agent LLM Systems Fail (and How to Fix Them)" (2025)
Galileo AI — "Why Do Multi-Agent LLM Systems Fail" (2025)
Wave Access — "Why multi-agent systems fail: three causes and how to fix them" (2026)
Augment Code — "How to Build a Multi-Agent AI System for Code Development" (2026)
Fan, C. et al. — "Single-agent or Multi-agent Systems? Why Not Both?" (2025)
Zhuge, M. et al. — "When Single-Agent with Skills Replace Multi-Agent Systems and When They Fail" (2026)
Zhang, Q. et al. — "A Comprehensive Empirical Evaluation of Agent Frameworks on Code-centric Software Engineering Tasks" (2025)
Gupta, R. — "Why Multi-Agent Systems Often Fail in Practice" (2025)
Osmani, A. — "Claude Code Swarms" (2026)
Builder.io — "You Probably Don't Need Claude Agent Teams" (2026)
Vincent, J. — "Superpowers 5" (2026)
VS Code — "Subagents in Visual Studio Code" (2026)
Anthropic — "2026 Agentic Coding Trends Report" (2026)
Swarmia — "Five levels of AI coding agent autonomy" (2026)
Morph LLM — "We Tested 15 AI Coding Agents (2026)" (2026)