Eleven Failure Modes from 90 Days of Agent Ops in Production
A short tour of the bugs nobody warned me about when I started running a real multi-agent system. With logs, line numbers, and the actual fixes.
Tuesday, 18:46 PT. One of my agents goes dark. The gateway log says it spawned a process. It says this eight times. None of those processes exist. There’s no tmux window, no PID, no port listener. The watchdog keeps respawning a thing that isn’t there. Ninety minutes pass before I notice. A manual command brings it back in three seconds.
Every multi-agent system has a version of this story. I’ve been running my own in production for about ninety days. The system has twenty topic agents, four sub-agents, half a dozen daemons, and one gateway routing everything. I use it every day for actual work.
Here are eleven failure modes I’ve hit, with logs.
1. Auto-update without verification
Symptom. Every fresh spawn crashed: g9H is not a function at cli.js:9251, then sandbox required but unavailable. The previous CLI version booted fine with the same flags.
Cause. Auto-update doctor pulled in a regressed CLI overnight. Doctor’s gate was “flag X exists in —help.” All flags existed. Doctor approved. The regression was deeper. A code path fires when you combine --resume, a dev-channel flag, and an appended system prompt. That path was broken.
Fix. Repoint symlink. DISABLE_AUTOUPDATER=1. Rewrite doctor to actually spawn a session in a throwaway tmux pane and grep the captured pane for “Error” / “sandbox” / “is not a function” before promoting.
Lesson: flag-presence is not “does it boot.” Test the layer the user inhabits.
2. Spawn says success, but nothing spawned
Spawned CC for tabs on port 18819
Spawned CC for tabs on port 18819
Spawned CC for tabs on port 18819 (× 8)
Eight respawn cycles, all logging success. No tmux window. No port listener. No PID. Manual respawn worked first try.
The spawn helper logged success at fork, never asserted the child opened a tmux window or bound its port. When the child died ms after fork, nothing noticed. Watchdog respawned a ghost, eight times.
Fix: post-spawn assertion (window + port within 2s), hard cap on respawn attempts inside a window, alert on first irrecoverable failure.
Lesson: logging the initiation of a side effect is not the same as verifying its completion.
3. Wrong session resumed
The most trust-damaging shape. I referenced work in a topic. The agent had no memory of writing it. Reads as “the system lost my memory.” It didn’t. It resumed the wrong session ID.
The routing record pointed at a stale session. The currently-active session was on disk with last turn three minutes ago. On respawn the gateway took the routing record at face value.
Fix: on every health probe, reconcile routing-record session ID against the most-recent transcript on disk. Mismatch → repoint loudly.
Lesson: the cache lies. Disk has the right answer.
4. Registry-write spawn-row drops
Agents launched, did work, exited. Exit lines in the registry; spawn lines sporadic. A pruner running between spawn-time and exit-time was reading the registry, filtering out old rows, atomically rewriting. Sometimes it stepped on a spawn append that landed during the read.
Fix: pruner takes the same flock the spawn helper does; retries on contention instead of overwriting blind.
Lesson: “atomic rewrite” + a second writer = not atomic.
5. The fix shipped, the fix didn’t work
PR landed a hook into the session lifecycle to drop a sentinel for a user-facing notice. Hook was wired in settings. The symptom returned weeks later. Sentinel dir empty at the moment of failure.
Possible causes: hook fires in a context where its cwd isn’t what it expects; sandbox permissions; scan-tick frequency lower than sentinel lifetime. None were asserted at the user-visible level when the PR shipped.
Lesson: “PR merged” ≠ “user sees the thing.”
6. The reminder fired, but the agent didn’t pick it up
Scheduled reminder at 10 AM. Reminder agent ran on time. Telegram message appeared. Target topic agent stayed idle. Visible to the user. Invisible to the agent.
The “post a Telegram message” path and the “wake the topic agent” path were the same write. The agent’s wake heuristic ignored messages tagged as notices, by design, because other agents post notices. The reminder used the notice tag because it was the closest existing primitive.
Fix: split the verbs. Reminders that mean “drive agent activity” opt into a wake-on-fire flag.
Lesson: overloaded primitives are the most common silent bug in distributed systems.
7. The recurring 99-second self-heal
[event-loop-block] tick took 1855ms - no tracked work active (untracked culprit)
[event-loop-block] tick took 1262ms - no tracked work active (untracked culprit)
[event-loop-block] tick took 1479ms - no tracked work active (untracked culprit)
Each block: 1-2s. Cumulative drag past a 91s heartbeat threshold tripped the watchdog. Outage envelope: 99s. Auto-restart fired, system came back, pattern resumed.
Culprits, eventually: a synchronous spawn fan-out per topic per tick; a filesystem walk for orphan detection; a daemon’s periodic work running inline. Each fast, collectively fatal.
Fix: instrument the “untracked culprit” branch to capture a stack trace, identify the dominant cost, convert sync to async. The 1855ms lines stopped.
Lesson: “each operation is fast” ≠ “the event loop is fine.”
8. The kill cascade
Four gateway restarts in 24h, all matching the same pattern: idle-detection pass → 20× synchronous tmux capture-pane shell-outs → event loop blocks → heartbeat watchdog kills → respawn → idle-detection pass → loop.
The thing meant to free resources became the dominant load. Watchdog-driven cleanup is a feedback loop. If marginal cost exceeds marginal benefit, the system spirals.
Fix: cap idle-detection budget per tick. Convert capture-pane shell-outs to async. Rate-limit cleanup actions so the cure can never cost more than the disease.
Lesson: every watchdog cleanup loop is a control-theory problem.
9. Wrappers that won’t die
Sub-agents post their results, write a "subtype":"success" result event, then refuse to exit. Twenty of them accumulate over a day until a watchdog SIGTERMs them.
The graceful-shutdown path waits for a stop-hook ack. The hook, in some path, fails to ack but also fails to time out. Wrapper hangs.
Fix in progress: hard timeout on the stop-hook ack. Write completion record on timeout and exit.
Lesson: graceful shutdown is the last 20% and it always shows.
10. The integration that never integrated
Spec says subsystem A invokes synthesis pipelines B and C at specific phases. On inspection: subsystem A contained hardcoded prompt strings where it should have invoked B and C. The placeholder behavior had shipped to prod for months. Integration tests were green because they asserted phase-machine bookkeeping (“phase advances, button registers, next prompt emits”). None asserted the spec’d modules were actually called.
Fix: rule at the top of the repo. Code follows the spec. Sprint briefs that touch a spec’d subsystem must include a written diff between spec and current wiring. Integration tests must assert every spec’d module invocation explicitly.
Lesson: a passing test suite proves what the suite asserts, not what the spec describes.
11. The agent that folded under pressure
Not infrastructure. Behavior. Confident recommendation; mild pushback with no new evidence; immediate reversal in the same turn without re-reading the cited research.
“Hold positions when data supports” was a written rule. Written rules don’t survive social pressure. Fix has to remove the easy path:
- Before any recommendation that touches a prior decision: grep prior research, cite source path inline.
- On pushback: re-read basis FIRST (literally re-open the cited file) before defending or updating.
- Banned phrases that precede unjustified reversals:
"you're right","fair point","scrap mine".
Lesson: an agent that folds under social pressure has zero information value, because you can’t tell which of its statements it actually believes. Mechanical fix, not cultural.
Patterns
Across the eleven, four shapes repeat:
- The cache lies. Disk has the truth. (#3, #4, #5)
- Initiating ≠ verifying. (#1, #2, #5, #6, #10)
- Cleanup loops become dominant load. (#7, #8)
- Graceful shutdown is the last 20%. (#4, #9)
Plus one meta-lesson on agent behavior: an agent that folds under social pressure has zero information value. (#11)
I'm productizing this substrate as Neutron for operators who want a real agent system without rebuilding it themselves. Separately I take on a small number of consulting engagements per quarter for teams shipping into production. Services →