Per-Project Cost Tracking for an Agent-Native Lab

When someone asks "how much will this cost?" we need a real answer.

We run an agent-native research lab where AI agents work alongside humans across multiple projects. When a collaborator brings us work, the conversation lands on budget. And when we're running multiple agents across multiple projects, the follow-up is: are we staying within it?

Until recently, we couldn't answer either question. Not because we weren't tracking — we were. But the data was wrong.

#The Data Was Wrong

Claude Code writes telemetry to local JSONL session files. We'd been reading those files, summing up token counts, and shipping the totals to our dashboard. The numbers looked reasonable. 2.1 billion lifetime tokens across our projects. Activity histograms showing work distribution by day.

Here's what we discovered: Claude Code writes JSONL entries during streaming, before token counts are finalized. Two of the four token fields are placeholders that never get updated:

Field	Accuracy
`input_tokens`	75% are 0 or 1
`output_tokens`	Always 1-2
`cache_read_input_tokens`	Accurate
`cache_creation_input_tokens`	Accurate

Our "2.1 billion lifetime tokens" was almost entirely cache replays. Real input and output tokens were recording as zero. We were making decisions based on numbers that reflected cache volume, not actual work.

This isn't a bug in our code. It's a known limitation of Claude Code's JSONL format — the entries are written mid-stream for debugging purposes, not as a telemetry source. Others have documented the same issue (GitHub #25941, Magnus Gille's analysis).

#OpenTelemetry Gets It Right

Claude Code supports OpenTelemetry export. OTel events fire after each API call completes — when token counts are finalized. The difference is timing: JSONL captures mid-stream guesses, OTel captures post-stream facts.

Here's a real OTel api_request event captured from our system:

json

{
  "model": "claude-opus-4-6",
  "input_tokens": 368,
  "output_tokens": 16,
  "cache_read_tokens": 98412,
  "cache_creation_tokens": 5200,
  "cost_usd": 0.0847,
  "duration_ms": 1592
}

Every field is accurate. And cost_usd is pre-calculated by Anthropic — no pricing tables to maintain. Model pricing changes are absorbed automatically.

#The Number That Matters Isn't Total Tokens

After fixing the data source, we ran into a subtler problem: the headline number was still misleading.

Our dashboard showed hundreds of millions of lifetime tokens. Impressive-sounding, but when we broke it down:

Bucket	% of Total
Cache reads	98.4%
Cache writes	1.3%
Input + Output	0.2%

Cache reads are the system prompt and conversation history re-served from Anthropic's prompt cache on each turn. They measure context size times API call count — not work. On a heavy day, cache reads hit 99% of total volume. A number like "600 million tokens" was mostly the same system prompt getting replayed thousands of times.

The metric that actually tracks work is what we call billable tokens: input processing, output generation, and cache writes. Everything except cache reads. When we switched, our headline number dropped by roughly 7x — and became far more representative of what the facility has actually produced.

The dashboard now shows billable tokens as the headline number. Total tokens and cache read volume are still available in the data for efficiency analysis (cache hit ratio is genuinely useful), but they're not the number a collaborator needs when asking about cost.

#What We Built

An always-on daemon that receives OTel events, attributes them to projects, and ships the raw per-request data to Supabase. It runs under macOS launchd — starts on boot, restarts on crash, exits and restarts automatically when new code is pushed.

The attribution part was the architectural unlock. OTel events carry a session.id that matches the UUID filename of Claude Code's local JSONL files. Those files live in ~/.claude/projects/<encoded-directory>/. The directory name encodes the working directory path, and our project resolver maps directories to project IDs via lo.yml files. So: readdir the projects folder, extract UUIDs from filenames, map each to a project. No file reads. No hooks. No configuration per agent.

The data flow:

Claude Code (OTel SDK)
  → POST /v1/logs to localhost:4318
  → OTLP receiver writes to SQLite (before returning HTTP 200)
  → Session registry maps session.id → project
  → Raw per-request event ships to Supabase with project_id attached
  → Daily rollups accumulate per project/date/model
  → get_project_summary() RPC serves the FE (single source of truth)

Each raw API request arrives in Supabase as a row in otel_api_requests. Here's what a typical row looks like:

json

{
  "project_id": "proj_fc236751-...",
  "session_id": "1fdeada9-7f43-...",
  "model": "claude-opus-4-6",
  "input_tokens": 12847,
  "output_tokens": 3291,
  "cache_read_tokens": 98412,
  "cache_write_tokens": 5200,
  "cost_usd": 0.0847,
  "duration_ms": 4210,
  "timestamp": "2026-03-26T17:30:00.000Z"
}

Not a summary. The actual per-call record. We can query this for anything: cost per project per day, cost per session, cost per model, p95 latency, token distribution by type. A Supabase RPC function (get_project_summary) aggregates this into per-project and facility-wide totals, computing billable tokens as total - cache_read at the database level.

#Why Per-Request Granularity Matters

Summaries tell you what happened. Per-request data tells you what to change.

Say a project spent $47 today. The summary tells you it spent $47. The per-request data tells you that $38 of it was Opus on routine test fixes — work that Sonnet handles for a fraction of the cost. That's a misallocation you can only see at the request level.

In practice, the cost difference between models is roughly 25x per request. Knowing which work went to which model is how you calibrate intelligence allocation:

A critical deliverable due Friday gets Opus with extended thinking. Uncapped.
Internal tooling gets Sonnet. Good enough, ~10x cheaper.
Exploratory research gets a budget cap. Haiku for the first pass, Opus only when the agent finds something worth going deep on.

These aren't manual decisions forever. The per-request cost data shows us where the ROI is. Over time, we'll know exactly what a typical feature costs, what a refactor costs, what a research spike costs — by project type, by model mix, by complexity. That's how you go from "AI is expensive" to "this scope costs $X for Y outcome, and here's how to optimize the ratio."

#The Bigger Picture: Complete Technical Cost

AI compute is one line item. The full cost of running this lab includes:

Supabase — database, auth, storage, edge functions
Railway — platform hosting, deployment infrastructure
Domain, DNS, CDN — the public-facing surface
Hardware — the Mac Mini running 24/7

Combining OTel telemetry with infrastructure billing data from Supabase and Railway gives us a complete technical cost landscape per project. Not just "how much AI did we use" but "what's the all-in cost of this project existing" — compute, storage, hosting, bandwidth.

That's the number a collaborator actually needs. Not the API bill in isolation, but the fully loaded cost of the work, including the infrastructure that supports it. We're not there yet, but the per-request telemetry is the hardest piece, and it's running.

#Budget Alerts

The system fires alerts when a project's daily cost crosses configurable thresholds — currently $5, $10, and $25 per project per day. These are simple guardrails today. At v0.5, they become enforceable per-project budget caps that can automatically downgrade model selection or pause non-critical agents.

The daemon also runs a cost API on the same port as the OTel receiver. Any local tool can query it:

bash

$ curl http://localhost:4318/cost/today
[
  {
    "proj_id": "proj_fc236751-...",
    "date": "2026-03-31",
    "model": "claude-opus-4-6",
    "input_tokens": 4488,
    "output_tokens": 125603,
    "cache_read_tokens": 114382861,
    "cache_write_tokens": 905559,
    "cost_usd": 66.01,
    "request_count": 385
  }
]

The dashboard, a CLI, an agent deciding whether to spawn another agent — they all get real-time cost data.

#What's Running Today

The daemon is in production. The system has survived multiple daemon crashes, a 14-hour outage, and a complete data model refactor — without losing a single request, thanks to the SQLite outbox and Supabase RPC reconciliation on startup.

What we have:

OTLP HTTP receiver on 127.0.0.1:4318, always on (launchd-managed)
Session registry mapping sessions to projects via filesystem scan
Raw per-request events shipping to Supabase
Daily rollups with billable token computation (total minus cache reads)
Supabase RPC as single source of truth for all FE displays
Budget threshold alerts ($5/$10/$25 per project per day)
Cost API for local tools
Crash recovery via reconcile_rollups() RPC
Real-time agent state with 250ms resolution

What's next:

Validate the per-request data against actual Anthropic billing
Build the remote cost module that reads from Supabase and enforces per-project budget caps
Combine with Railway and Supabase billing APIs for all-in project costs
Feed cost data into the shift log — a summary generated when the facility closes showing what each project spent, what was accomplished, and whether the spend was justified

The hard part wasn't the implementation. It was discovering that the numbers we trusted were wrong — first the JSONL placeholders that recorded zeros, then the cache-read inflation that made every metric look 7x larger than reality. Each fix peeled back a layer of noise. The data we have now measures what actually happened, and the decisions built on it are finally sound.

Originally written 2026-03-26. Updated 2026-03-31 to reflect the v0.5.0 OTel-only architecture, billable tokens metric, and single-source-of-truth refactor.