12 Papers on Agent Runtimes: What Worked, What Didn't

Michael HofwellerMichael HofwellerApr 15, 202618 min readPublished
ai-agentsagent-runtimesagent-frameworksmemory-systemstool-useresearch-synthesis

Before we committed to design choices in our own kernel — a small TypeScript runtime called Auggy (augment-1, v0.1.1) — we wanted to separate what the literature can actually demonstrate from what it only asserts.

We read twelve sources: nine arXiv-listed research papers, two industry publications, and a structured adversarial review of one widely-deployed open-source framework. This is what survived contact with the evidence — and what didn't.

95% → 25%
Tool-call accuracy as catalog grows from 1 → 8 tools (ALARA)
Accuracy collapse
21.8%
Production input tokens that are structural waste (Mason)
Token waste
74.0%
LoCoMo score where a plain filesystem beat specialized memory APIs (Letta)
Simpler system wins

The strongest empirical findings cluster around a single observation: every part of an agent you treat as static — tool catalogs, context windows, memory APIs, bootstrap files — costs measurable accuracy, measurable tokens, or both. The clearest failures came from frameworks that built specialized infrastructure for problems that turned out to have simpler structural answers.

A caveat up front: we have not validated any of this against our own production traffic. The kernel is at v0.1.1 and not yet carrying production load. The findings here are external evidence we used to make design choices, not results we re-measured. Where a finding shaped a decision in our own kernel, we say so.


#What worked: empirical findings that held up

#Context is a memory hierarchy, not a single tier

Mason's Missing Memory Hierarchy paper [3] measured a corpus of 857 production sessions totaling 4.45 million input tokens and found 21.8% of those were structural waste — content carried forward turn after turn that wasn't doing any work. His Pichay proxy cut context consumption by up to 93% in live deployment. AgentRM [1] reported its Context Lifecycle Manager retained 100% of key information at 95% quality, against 65.1% retention at 87% quality for existing approaches. AIOS [7] made context management an explicit kernel primitive in 2024 and reported 2.1× end-to-end speedup.

Mason puts the framing bluntly: context limits, attention degradation, cost scaling, and lost cross-session state are virtual memory problems wearing different clothes.

Most systems implement context as a single tier: system + tool defs always-on, everything else appended as chronological history that slides — newest kept, oldest dropped. This is simple, and it is wrong in a very specific way: it assumes age is a proxy for importance. It isn't. The original user goal is often written in the first few turns and is also the most important piece of information in the whole session — in a sliding window, it's systematically the first thing to go.

A hierarchy labels content at write time by importance, then evicts lowest tier first:

  • Tier 0 — always resident. System, tools, original user goal, declared preferences.
  • Tier 1 — hot working set. Current subgoal, last few turns, memories retrieved for this turn.
  • Tier 2 — warm. Older session turns; first to go when budget tightens.
  • Tier 3 — cold. Earlier history paged out to a SQLite table or vector store; re-fetched only on demand, not in the prompt budget.

Imagine turn 50 of a session — 19k of in-window content competing for a 16k budget (tier 3 sits in cold storage, outside this exercise):

5.5k
Tier 0 — sys + tools + original user goal
5.5k
Tier 1 — current subgoal, last 3 turns, memories
8k
Tier 2 — older session turns (turns 2–25)
PolicyTier 0Tier 1Tier 2
Sliding windowCutKeepKeep
HierarchyKeepKeepCut

The agent under a sliding window keeps working but has no memory of what was originally asked. Single-tier pays for whatever fits every turn forever, most of it not load-bearing (Mason's 21.8%); a hierarchy pays for tier 0 linearly, turns tier 1 over with the task, and only dips into tier 3 when it matters — which requires the expensive fix.

#Small tool catalogs are dramatically better than big ones

ALARA [2] measured tool-invocation accuracy across 22 models, 115 tasks, and 2,530 executions. Accuracy falls from ~95% to ~25% as the catalog grows from one to eight tools — a collapse that happens squarely within the range most frameworks ship by default.

This is not just a scaling issue; it's a structural one.

A tool catalog lives inside the prompt. Every additional tool:

  • consumes tokens
  • competes for attention
  • introduces ambiguity in selection

The model is forced to do two things at once: decide what tool is relevant and execute the reasoning for the task. As the catalog grows, these interfere.

Prompt-based restriction ("only use X if…") does not solve this. ALARA is explicit: it produces "a suggestion the model may or may not follow." The model is still exposed to all tools; nothing enforces the constraint.

The only mechanism that consistently works is structural omission: removing irrelevant tools entirely from the decision space.

This reframes the problem. The question is not "how many tools should we show?" It is "how do we decide which tools exist for this turn?"

#Tool-call granularity is the right resource boundary

If you've run agents in production, you already know this pattern — most of your wall-clock time is infrastructure, and memory usage looks nothing like steady-state. AgentCgroup [9] put numbers on it: 56–74% of end-to-end latency in agent workloads is OS-level execution, not LLM inference, and memory usage is highly non-uniform — tool calls create bursts with peak-to-average ratios up to 15.4×.

56–74%
Of end-to-end latency is OS-level execution, not inference
15.4×
Peak-to-average memory ratio inside tool calls

Most production systems enforce limits at the container level. This assumes resource usage is roughly stable over time. It isn't.

The reality is spiky:

  • the agent idles or performs light reasoning most of the time
  • a single tool call may require an order of magnitude more memory for milliseconds

Container limits cannot express this. If you provision for the peak, you waste resources most of the time. If you provision for the average, the agent crashes during bursts.

This is not a tuning problem. It's a layer mismatch.

The unit that bursts is the tool call. The unit being constrained is the container.

#Working-memory compression should be the model's decision

Long-horizon tasks produce long histories: action, observation, action, observation. After enough steps, most of the prompt is stale interaction.

Frameworks handle this mechanically:

  • sliding windows drop the oldest turns
  • periodic summarization compresses everything beyond a threshold

Both approaches are blind. They operate on position, not meaning.

HiAgent [4] changes the control point. The model declares a subgoal ("find the correct config file"), executes against it, and when the subgoal completes, replaces the entire interaction trace with a summary of what was learned.

This matters because the model has information the framework does not:

  • what the current task is
  • what information is still relevant
  • what has been fully consumed

The result is a 2× success rate and a 3.8-step reduction across long-horizon tasks.

#Behavioral contracts catch real violations cheaply

Agent Behavioral Contracts [8] demonstrates something most practitioners have observed informally: agents drift.

Over long sessions, they:

  • stop following instructions
  • call tools incorrectly
  • violate constraints

Without intervention, this drift accumulates.

The paper formalizes both the detection mechanism and the dynamics. Contracts define explicit constraints (e.g., maxToolCallsPerTurn, neverExpose). The runtime enforces them at execution time.

Empirically, across 200 scenarios, 7 models, 6 vendors, and 1,980 sessions:

5.2–6.8
Soft violations per session caught vs uncontracted baselines (p < 0.0001)
<10 ms
Runtime overhead per action — negligible vs inference latency
88–100%
Hard-constraint compliance under contracts

The deeper result is the Drift Bounds Theorem. Without correction, drift accumulates without bound. The theorem says that if your runtime can detect-and-correct violations at rate γ faster than the agent drifts at rate α, expected drift is mathematically bounded:

Drift Bounds Theorem (Bhardwaj 2026): α = natural drift rate (agent wanders away from intended behavior) γ = recovery rate (runtime detects and corrects violations) Precondition: γ > α Bound: D* = α / γ in expectation

#Kernel minimalism is a position multiple independent teams have moved toward

OpenAI's Agents SDK [11] ships three primitives — Agent, Handoff, Guardrail — with the explicit design constraint: "enough features to be worth using, but few enough primitives to make it quick to learn." AIOS [7] converged on five (scheduling, context, memory, storage, access control). Letta's March 2026 pivot [10] is a related signal on a different axis: they deprecated their tool-rules system "to avoid inhibiting frontier capabilities," explicitly removing kernel-level behavioral constraints in favor of trusting the model.

These are not identical approaches, but they converge on a shared instinct: don't over-encode structure in the kernel.

The failure mode is easy to miss. Every primitive you introduce:

  • assumes a stable abstraction
  • shapes how developers build systems
  • constrains what the model can do

If the abstraction is wrong — or just premature — it becomes a permanent limitation.

#Static RAG is not enough for long-term agentic memory

Standard retrieval-augmented generation embeds everything — documents, past conversations, observations — into a vector space and pulls back entries most semantically similar to the current query.

That works when the relevant information shares vocabulary with the query. Long-term memory isn't shaped that way. It's a network of references.

SYNAPSE [5] named the failure mode: Contextual Tunneling. The embedding can't traverse across conceptually-distant but causally-connected pieces of memory.

Imagine an agent that learned in turn 5 you're working with a contractor named Sam. In turn 50 you ask "what did Sam recommend?" The relevant earlier observation might be "the contractor said the roof needs work" — embedded nowhere near "Sam recommend" in vector space, but the right answer.

Vector similarity alone can't follow the chain:

Sam → contractor → recommendation

SYNAPSE's proposal is a hybrid:

  • geometric embeddings handle semantic similarity
  • spreading activation over a memory graph follows causal connections — query "Sam" and activation propagates from the Sam node to its neighbors (contractor role, recommendations made, related interactions)

The hybrid surfaces connected information even when it isn't semantically close to the query string, and outperforms either approach alone on LoCoMo's temporal and multi-hop reasoning tasks.


#What didn't: failures with traceable causes

#Specialized memory APIs lost to a filesystem

Specialized memory APIs tend to lose to general-purpose abstractions — files, shell commands, git — because the assumptions the API encodes about memory behavior turn out to be wrong in practice. Letta's trajectory is the cleanest case in the corpus: they built the APIs, measured them against a filesystem, and replaced them.

Formerly known as MemGPT, Letta's initial product centered on specialized server-side memory tools — core_memory_replace, archival memory, recall memory. In August 2025 they published a result showing their plain-filesystem implementation scored 74.0% on LoCoMo, beating their own specialized memory libraries head-to-head on the same public benchmark.

In March 2026 they shipped Our Next Phase [10], deprecating the specialized APIs in favor of generalized computer-use tools operating on git-backed files:

The key shift is not storage — it's interface. Files expose a general-purpose abstraction. Memory APIs encode assumptions about how memory should behave.

Those assumptions turned out to be wrong often enough to matter.

#Hardcoded bootstrap files don't scale beyond personal-assistant use cases

Bootstrap layers hardcoded to a single agent shape don't survive scope expansion. OpenClaw is the case study — a widely-deployed open-source agent framework that auto-loads exactly seven named bootstrap files per session [12]:

FileLoaded forPurpose
SOUL.mdEvery sessionAgent identity and personality
AGENTS.mdEvery session, including sub-agentsBehavioral rules
USER.mdEvery sessionInformation about the user
BOOTSTRAP.mdOne-time onlyFirst-run ritual, deleted after

Plus IDENTITY.md, TOOLS.md (sub-agents only), and HEARTBEAT.md, all auto-loaded every session.

The caps are rigid:

  • per-file limit: 20,000 characters (Issue #54623, silent truncation past the limit)
  • total limit: 150,000 characters
  • no additionalDirectories configuration, no glob loading, no custom file lists
  • sub-agents receive only AGENTS.md and TOOLS.md — five of the seven files are excluded from delegated work

The design is internally consistent for a personal-assistant agent and works well within that scope. The agent can still read other files at runtime via tools. The constraint is what auto-loads into every session's context.

The seven slots cover what a personal assistant should always have on hand: who I am, who I serve, what I can do.

The always-on-hand set is wider for other shapes of agent:

  • customer support: product documentation, incident history
  • research: project notes, paper citations
  • organizational: people directory, meeting context

None of it fits the seven named slots, and there's no additionalDirectories knob to add more.

An agent can fs_read these materials per-turn, but that converts ambient knowledge into conditional knowledge — something the model has to remember to fetch, every turn, at a tool-call cost.

The bootstrap layer was designed around one shape of agent and has no expansion path for any other.

#Unrestricted tool execution as a structural gap

Every system that executes actions needs an authorization layer. OpenClaw doesn't have one.

Issue #12565, opened February 2026, documents "Agent Runtime: Unrestricted Tool Execution Leading to Privilege Escalation," labeled with both bug and security, CVSS 4.5, CWE-862 (Missing Authorization). The mitigation in the public ecosystem is a third-party governance plugin written by a community contributor.

What matters is not the bug. It's the absence of a layer to prevent it — authorization is a first-class concern in any system that executes actions, and its absence at the runtime layer is structural, not incidental.

#Model-version regressions accumulate when there's no place to handle them

A runtime with no mechanism to absorb model-version regressions surfaces them directly in production. OpenClaw's interaction with Claude Sonnet 4.6's 1M-token context window shows this concretely — three stacked memory-flush bugs, each traced to a specific GitHub issue, plus a community-submitted fix PR reportedly rejected as spam.

Models change. Behavior regresses. Without a runtime layer to detect and compensate, the breakage surfaces on the most-deployed long-context model — silent, recurring, and owned by no one.

#Eval ecosystems can fragment to the point of uselessness

If you delegate evals to the ecosystem, this is what you end up with. OpenClaw's main repository ships no internal eval harness; the community-built ecosystem consists of three projects with no shared methodology:

ProjectStatusCoverage
Claw-Eval~340 stars, activeEnd-state Pass^3 grading, 139 tasks
openclaw-memory-bench~1 star, single developerMemory retrieval IR metrics only
FinClaw skill-creatorEmbedded in Chinese-language forkMeta-eval inside skill creation

#Where the research disagrees with itself

The disagreements are not noise — they define the open questions.

Episodic memory is a clear example. Pink et al. [6] argue it is a missing primitive with formal requirements. Letta's filesystem result [10] suggests a simpler mechanism captures most of the value. This is not a contradiction so much as a mismatch in scope: one is a research ideal, the other a production approximation.

Tool catalogs show a sharper divide. ALARA's [2] evidence is empirically unambiguous — smaller is better. In practice, catalogs keep growing because the infrastructure required to make them dynamic is non-trivial. This is a gap between what works and what ships.

Kernel design is the deepest disagreement. OS-style systems — AIOS [7], AgentRM [1] — argue for explicit scheduling and control. Minimal SDKs like OpenAI's Agents SDK [11] push that responsibility outward. Both positions are internally consistent. The tradeoff is real and unresolved.


#A note on evals

A separate survey of the eval landscape (compiled April 2026) traced the practitioner consensus from Anthropic's published guidance, shipping engineers, and recent academic sources. The principles recur with high agreement:

  • Grade outcomes, not transcripts.
  • Use Pass^k rather than Pass@1 to eliminate one-shot luck.
  • Build a starting suite of 20–50 tasks derived from real failures, rather than 500 synthetic ones.
  • Use deterministic graders where possible, and binary, human-calibrated LLM-as-judge where not.
  • Keep capability evals separate from regression evals — they answer different questions.

The fragmented OpenClaw ecosystem is the cautionary case for what happens when none of this lives at the runtime layer. A runtime that ships eval affordances from the first commit is doing operational work an external community project cannot.


#Takeaway

Compressed into constraints you could bet a runtime on:

  • Single-tier context will cost you twice — once on tokens you pay to carry forward uselessly (Mason's 21.8% structural waste is a floor, not a ceiling), and again when the window fills and the user's original goal ages out as "oldest."
  • A static tool catalog will be wrong for most turns. ALARA's 95%→25% accuracy collapse at eight tools is the tax, and "only use X if…" doesn't enforce anything.
  • Specialized memory APIs encode assumptions that break in production. Letta already ran that experiment and replaced theirs with a filesystem — if you ship one, budget for the same pivot.
  • Behavioral rules in prose are suggestions. Drift Bounds (Bhardwaj 2026) gives you a stability guarantee only when correction outruns drift — and correction is a runtime property, not a prompt.
  • Every kernel primitive is a two-year commitment. If you can't defend it at that horizon, it's a future deprecation surface.
  • Evals outside the runtime fragment. OpenClaw is the published version — three community projects, no shared methodology, no regression signal you can trust.
  • A bootstrap layer hardcoded to one agent shape won't fit another. Every non-matching agent pays per-turn tool-call costs for ambient knowledge that should have auto-loaded.

#References

  1. She, J. — AgentRM: An OS-Inspired Resource Manager for LLM Agent Systems, arXiv:2603.13110 (2026). MLFQ scheduling; Context Lifecycle Manager; 40,000+ GitHub issues analyzed across six frameworks.
  1. Agostino, C.J. & D'Souza, N. — ALARA for Agents: Least-Privilege Context Engineering Through Portable Composable Multi-Agent Teams, arXiv:2603.20380 (2026). The 95% → 25% accuracy curve at 1 → 8 tools across 22 models and 2,530 task executions.
  1. Mason, T. — The Missing Memory Hierarchy: Demand Paging for LLM Context Windows, arXiv:2603.09023 (2026). 21.8% structural waste in production; Pichay demand-paging proxy.
  1. Hu, M. et al. — HiAgent: Hierarchical Working Memory Management for Solving Long-Horizon Agent Tasks, arXiv:2408.09559 / ACL 2025. Subgoal-based working-memory compression decided by the LLM.
  1. Jiang, H. et al. — SYNAPSE: Empowering LLM Agents with Episodic-Semantic Memory via Spreading Activation, arXiv:2601.02744 (2026). Spreading-activation memory with lateral inhibition; the Contextual Tunneling problem.
  1. Pink, M. et al. — Position: Episodic Memory is the Missing Piece for Long-Term LLM Agents, arXiv:2502.06975 (2025). Five formal properties of episodic memory.
  1. Mei, K. et al. — AIOS: LLM Agent Operating System, arXiv:2403.16971 (2024). The original LLM-OS framing; five kernel primitives.
  1. Bhardwaj, V.P. — Agent Behavioral Contracts: Formal Specification and Runtime Enforcement for Reliable Autonomous AI Agents, arXiv:2602.22302 (2026). Drift Bounds Theorem; 5.2–6.8 violations per session caught at under 10ms enforcement overhead.
  1. Zheng, Y. et al. — AgentCgroup: Understanding and Controlling OS Resources of AI Agents, arXiv:2602.09345 (2026). Tool-call granularity; OS-level execution as 56–74% of end-to-end latency.
  1. Letta — Our Next Phase, Letta blog (March 2026). Deprecation of specialized memory APIs in favor of git-backed files; references Letta's August 2025 LoCoMo result (74.0% on a plain filesystem).
  1. OpenAI — New tools for building agents, OpenAI blog (March 2025). Three primitives — Agent, Handoff, Guardrail — with the explicit "very small set" design commitment.
  1. OpenClaw adversarial review — A structured review compiled from public GitHub issues (Issues #54623, #12565, #52899) and operator-side documentation of bootstrap-file constraints, memory-flush bugs in long-context configurations, and the fragmented eval ecosystem (Claw-Eval, openclaw-memory-bench, FinClaw).

A separate eval-landscape survey (April 2026) draws on Anthropic's published eval guidance, multiple practitioner blog posts, and the structure of Claw-Eval, openclaw-memory-bench, and the FinClaw skill-creator loop. Full per-paper provenance, including verification status and direct quotes, is in the augment-1 repo at docs/research/research-provenance.md.