12 Papers on Agent Runtimes: What Worked, What Didn't
Building an agent runtime in 2026 is a problem with too many published opinions and too few measured ones. Before we committed to design choices in our own kernel — a small TypeScript runtime called Auggy (augment-1) — we wanted to know which claims in the recent literature were grounded in numbers we could trace.
We read twelve sources: nine arXiv-listed research papers, two industry publications, and a structured adversarial review of one widely-deployed open-source framework. This is what survived contact with the evidence — and what didn't.
The strongest empirical findings cluster around a single observation: every part of an agent you treat as static — tool catalogs, context windows, memory APIs, bootstrap files — costs measurable accuracy, measurable tokens, or both. The clearest failures came from frameworks that built specialized infrastructure for problems that turned out to have simpler structural answers.
A caveat up front: we have not validated any of this against our own production traffic. The kernel is pre-v0. The findings here are external evidence we used to make design choices, not results we re-measured. Where a finding shaped a decision in our own kernel, we say so.
The audience we had in mind is anyone in the same chair: building or maintaining a runtime, picking a memory architecture, deciding how big a tool catalog gets, choosing whether to ship contracts or just hope. Every load-bearing claim is traceable to a primary source listed at the end.
#What worked: empirical findings that held up
#Context is a memory hierarchy, not a single tier
Mason's Missing Memory Hierarchy paper [3] measured 21.8% of production input tokens as structural waste across 857 sessions and 4.45 million tokens; his Pichay proxy cut context consumption by up to 93% in live deployment. AgentRM [1] reported its Context Lifecycle Manager retained 100% of key information at 95% quality, against 65.1% retention at 87% quality for existing approaches. AIOS [7] made context management an explicit kernel primitive in 2024 and reported 2.1× end-to-end speedup.
Mason puts the framing bluntly: context limits, attention degradation, cost scaling, and lost cross-session state are virtual memory problems wearing different clothes.
#Small tool catalogs are dramatically better than big ones
ALARA [2] measured tool-invocation accuracy across 22 models, 115 tasks, and 2,530 executions. Accuracy fell from approximately 95% to approximately 25% as catalog size grew from one to eight tools — a curve that emerges within the range most frameworks ship by default. The paper is also explicit that prompt-based restriction ("do not use X unless…") produces "a suggestion the model may or may not follow"; structural omission is the only enforcement that meaningfully matters.
#Tool-call granularity is the right resource boundary
AgentCgroup [9] ran an empirical characterization across 144 SWE-rebench tasks: OS-level execution accounts for 56–74% of end-to-end task latency — the majority of wall-clock time is infrastructure, not LLM inference — and memory spikes are tool-call-driven, with up to 15.4× peak-to-average ratios.
The practical implication is a layer mismatch. Production agents typically run inside containers with memory caps set on the container, but the container holds the whole agent for its lifetime while the burst happens during one specific tool call lasting milliseconds. Set the cap high enough to handle peaks and you over-provision the rest of the time; set it at average and the agent dies every time a burst hits. Container limits face an impossible choice because they're enforcing at the wrong layer — the cap can't tell the difference between an agent doing normal work and an agent inside a tool call that needs 10× the memory for 200ms.
#Working-memory compression should be the model's decision
As an agent works through a multi-step task, history piles up: action 1, observation 1, action 2, observation 2, and so on. After enough turns, most of the prompt is old action-observation pairs, most of which are no longer relevant to the current step.
HiAgent [4] demonstrated a 2× success rate and a 3.8-step reduction across five long-horizon tasks by changing who decides what to drop. The model is prompted to declare a subgoal before generating actions ("first, find the right config file"); when the subgoal completes, the LLM itself replaces the action-observation pairs from that subgoal with a brief summary of what was learned, freeing context for the next subgoal.
Compare this against the usual approach: sliding-window compression ("keep the last N turns") or periodic summarization ("every M turns, fold everything older into a summary"). Both work mechanically; neither has any idea whether the thing being dropped is load-bearing for the next action.
#Behavioral contracts catch real violations cheaply
Agent Behavioral Contracts [8] ran an evaluation across 200 scenarios, 7 models, 6 vendors, and 1,980 sessions. The numbers are striking:
The paper also proves a Drift Bounds Theorem: agents wander from intended behavior over long sessions — violating constraints, inventing tools that don't exist, gradually stopping to follow instructions — and without correction this drift accumulates without bound. The theorem says that if your runtime can detect-and-correct violations at rate γ faster than the agent drifts at rate α, expected drift is mathematically bounded:
The same shape as a thermostat: room temperature drifts at rate α (heat loss through walls), the heater corrects at rate γ; if γ > α the room settles near setpoint with steady-state error α/γ, if γ < α the room cools forever. Crucially, this is a theorem, not a measurement: empirical results tell you "it worked on the cases we tried"; a theorem tells you "it will work whenever the precondition (γ > α) holds." That's a much stronger thing to ship behind.
#Kernel minimalism is a position multiple independent teams have moved toward
OpenAI's Agents SDK [11] ships three primitives — Agent, Handoff, Guardrail — with the explicit design constraint: "enough features to be worth using, but few enough primitives to make it quick to learn." AIOS [7] converged on five (scheduling, context, memory, storage, access control). Letta's March 2026 pivot [10] is a related signal on a different axis: they deprecated their tool-rules system "to avoid inhibiting frontier capabilities," explicitly removing kernel-level behavioral constraints in favor of trusting the model.
#Static RAG is not enough for long-term agentic memory
Standard retrieval-augmented generation works by embedding everything (documents, past conversations, observations) into a vector space and pulling back the entries most semantically similar to the current query. That works well when the relevant information shares vocabulary with what the model is asking about — but long-term memory isn't a flat collection of similar items, it's a network of references.
SYNAPSE [5] named the failure mode "Contextual Tunneling": the embedding can't traverse across conceptually-distant but causally-connected pieces of memory.
Imagine an agent that learned in turn 5 that you're working with a contractor named Sam, then in turn 50 you ask "what did Sam recommend?" The relevant earlier observation might be "the contractor said the roof needs work" — embedded nowhere near "Sam recommend" in vector space, but the right answer. Vector similarity alone cannot follow the chain Sam → contractor → recommendation.
SYNAPSE proposes a hybrid: geometric embeddings (good at semantic similarity) combined with spreading activation over a memory graph — when you query "Sam," activation propagates from the Sam node to its neighbors (contractor role, recommendations made, related interactions), surfacing connected information even when it isn't semantically close to the query string. The hybrid outperforms either approach alone on LoCoMo's temporal and multi-hop reasoning tasks.
#What didn't: failures with traceable causes
#Specialized memory APIs lost to a filesystem
Letta — formerly known as MemGPT, with an initial product centered on specialized server-side memory tools (core_memory_replace, archival memory, recall memory) — published a research result in August 2025 showing their plain-filesystem implementation scored 74.0% on LoCoMo, beating their own specialized memory libraries head-to-head on the same public benchmark.
In March 2026 they shipped Our Next Phase [10], deprecating the specialized APIs in favor of generalized computer-use tools operating on git-backed files:
This is the cleanest published case of a stateful-agent-runtime company concluding that one of its primary differentiators was the wrong abstraction.
#Hardcoded bootstrap files don't scale beyond personal-assistant use cases
OpenClaw, a widely-deployed open-source agent framework, auto-loads exactly seven named bootstrap files per session [12]:
| File | Loaded for | Purpose |
|---|---|---|
SOUL.md | Every session | Agent identity and personality |
AGENTS.md | Every session, including sub-agents | Behavioral rules |
USER.md | Every session | Information about the user |
IDENTITY.md | Every session | Agent self-identity |
TOOLS.md | Every session, including sub-agents | Environment-specific tool notes |
HEARTBEAT.md | Every session | Periodic check-in instructions |
BOOTSTRAP.md | One-time only | First-run ritual, deleted after |
Per-file cap: 20,000 characters (publicly documented in Issue #54623, with silent truncation past the limit). Total cap: 150,000 characters. There is no additionalDirectories configuration, no glob loading, no custom file lists. Sub-agents receive only AGENTS.md and TOOLS.md — five of the seven files are excluded from delegated work.
The design is internally consistent for a personal-assistant agent and works well within that scope. The agent can still read other files at runtime via tools — that isn't the constraint. The constraint is what auto-loads into every session's context. The seven slots cover what a personal assistant should always have on hand: who I am, who I serve, what I can do. The always-on-hand set is wider for other shapes of agent — product documentation and incident history for customer support, project notes and paper citations for research, a people directory and meeting context for organizational work. None of it fits the 7 named slots, and there's no additionalDirectories knob to add more. An agent can fs_read these materials per-turn, but that converts ambient knowledge into a tool-call cost the model has to remember to pay every turn. The bootstrap layer was designed around one shape of agent and has no expansion path for any other shape.
#Unrestricted tool execution as a structural gap
OpenClaw Issue #12565, opened February 2026, documents "Agent Runtime: Unrestricted Tool Execution Leading to Privilege Escalation," labeled with both bug and security, CVSS 4.5, CWE-862 (Missing Authorization). The mitigation in the public ecosystem is a third-party governance plugin written by a community contributor.
Whatever the intent, the structural absence of authorization at the runtime layer is a fixable gap.
#Model-version regressions accumulate when there's no place to handle them
Three stacked memory-flush bugs are documented in OpenClaw's interaction with Claude Sonnet 4.6's 1M-token context window, each traced to a specific GitHub issue. A community-submitted fix PR was reportedly rejected as spam. Every framework has bugs — that isn't the point. The point is the compounding gap: a runtime with no dedicated handling for model-version regressions, plus a maintainer process that closed out the available community fix. Together these produce silent breakage on the most-deployed long-context model.
#Eval ecosystems can fragment to the point of uselessness
OpenClaw's main repository ships no internal eval harness. The community-built ecosystem consists of three projects with no shared methodology:
| Project | Status | Coverage |
|---|---|---|
| Claw-Eval | ~340 stars, active | End-state Pass^3 grading, 139 tasks |
| openclaw-memory-bench | ~1 star, single developer | Memory retrieval IR metrics only |
| FinClaw skill-creator | Embedded in Chinese-language fork | Meta-eval inside skill creation |
The pattern to avoid: evals as an afterthought, delegated to the ecosystem, with no standard the kernel can rely on for regression detection.
#Where the research disagrees with itself
The corpus is not internally consistent, and the disagreements are themselves informative.
Episodic memory: missing primitive or emergent effect of good files? Pink et al. [6] argue episodic memory is the missing capability for long-term LLM agents and enumerate five formal properties any episodic memory system must support. Letta's filesystem result suggests files-on-disk plus naming conventions may already cover most of the practical gap. Both can be true: the position paper is a research roadmap; the production result is a practitioner workaround that captures most of the value. The unsettled question is how much gap files-as-memory leaves and which kinds of agents notice.
Tool catalogs: smaller is better empirically; frontier models keep shipping more. ALARA's evidence on catalog-size accuracy is unambiguous in the small-N regime. Frontier model releases keep increasing the number of tools an agent can hold simultaneously, and practitioners keep shipping agents with twenty-plus tools because the alternatives require extra infrastructure. The gap between what the evidence shows and what the field does is wide and is not closing.
Kernel-level scheduling: AIOS and AgentRM say yes; OpenAI Agents SDK is deliberately kernel-less. The OS-style papers argue scheduling, resource isolation, and context management belong inside a runtime kernel. OpenAI shipped a kernel-less SDK with three primitives and trusts the host language for orchestration. Both communities are coherent within their own assumptions; the assumptions differ. The choice is consequential and is not something the literature settles.
#A note on evals
A separate survey of the eval landscape (compiled April 2026) traced the practitioner consensus from Anthropic's published guidance, shipping engineers, and recent academic sources. The principles recur with high agreement:
- Grade outcomes, not transcripts.
- Use Pass^k rather than Pass@1 to eliminate one-shot luck.
- Build a starting suite of 20–50 tasks derived from real failures, rather than 500 synthetic ones.
- Use deterministic graders where possible, and binary, human-calibrated LLM-as-judge where not.
- Keep capability evals separate from regression evals — they answer different questions.
The fragmented OpenClaw ecosystem is the cautionary case for what happens when none of this lives at the runtime layer. A runtime that ships eval affordances from the first commit is doing operational work an external community project cannot.
#Takeaway
If we had to compress this corpus into a few constraints a runtime designer could carry around:
- Treat context as a memory hierarchy with eviction. The single-tier design is not free.
- Keep tool catalogs small, structurally enforced, and dynamic per turn.
- Don't ship specialized memory APIs in the kernel; let the provider contract stay abstract.
- Push behavioral rules into structure — types, capabilities, validation — not prose.
- Ship the fewest primitives you can defend, and resist the urge to add a primitive for every recurring pattern.
- Build eval affordances into the runtime from the first commit.
- Don't hardcode the shape of an agent's identity layer; the shape that fits a personal assistant doesn't fit anything else.
#References
- She, J. — AgentRM: An OS-Inspired Resource Manager for LLM Agent Systems, arXiv:2603.13110 (2026). MLFQ scheduling; Context Lifecycle Manager; 40,000+ GitHub issues analyzed across six frameworks.
- Agostino, C.J. & D'Souza, N. — ALARA for Agents: Least-Privilege Context Engineering Through Portable Composable Multi-Agent Teams, arXiv:2603.20380 (2026). The 95% → 25% accuracy curve at 1 → 8 tools across 22 models and 2,530 task executions.
- Mason, T. — The Missing Memory Hierarchy: Demand Paging for LLM Context Windows, arXiv:2603.09023 (2026). 21.8% structural waste in production; Pichay demand-paging proxy.
- Hu, M. et al. — HiAgent: Hierarchical Working Memory Management for Solving Long-Horizon Agent Tasks, arXiv:2408.09559 / ACL 2025. Subgoal-based working-memory compression decided by the LLM.
- Jiang, H. et al. — SYNAPSE: Empowering LLM Agents with Episodic-Semantic Memory via Spreading Activation, arXiv:2601.02744 (2026). Spreading-activation memory with lateral inhibition; the Contextual Tunneling problem.
- Pink, M. et al. — Position: Episodic Memory is the Missing Piece for Long-Term LLM Agents, arXiv:2502.06975 (2025). Five formal properties of episodic memory.
- Mei, K. et al. — AIOS: LLM Agent Operating System, arXiv:2403.16971 (2024). The original LLM-OS framing; five kernel primitives.
- Bhardwaj, V.P. — Agent Behavioral Contracts: Formal Specification and Runtime Enforcement for Reliable Autonomous AI Agents, arXiv:2602.22302 (2026). Drift Bounds Theorem; 5.2–6.8 violations per session caught at under 10ms enforcement overhead.
- Zheng, Y. et al. — AgentCgroup: Understanding and Controlling OS Resources of AI Agents, arXiv:2602.09345 (2026). Tool-call granularity; OS-level execution as 56–74% of end-to-end latency.
- Letta — Our Next Phase, Letta blog (March 2026). Deprecation of specialized memory APIs in favor of git-backed files; references Letta's August 2025 LoCoMo result (74.0% on a plain filesystem).
- OpenAI — New tools for building agents, OpenAI blog (March 2025). Three primitives — Agent, Handoff, Guardrail — with the explicit "very small set" design commitment.
- OpenClaw adversarial review — A structured review compiled from public GitHub issues (Issues #54623, #12565, #52899) and operator-side documentation of bootstrap-file constraints, memory-flush bugs in long-context configurations, and the fragmented eval ecosystem (Claw-Eval, openclaw-memory-bench, FinClaw).
A separate eval-landscape survey (April 2026) draws on Anthropic's published eval guidance, multiple practitioner blog posts, and the structure of Claw-Eval, openclaw-memory-bench, and the FinClaw skill-creator loop. Full per-paper provenance, including verification status and direct quotes, is in the augment-1 repo at docs/research/research-provenance.md.