Somewhere around agent number seven I realized I had a management problem.
Not a technical one. The agents worked fine individually. Each one did ...
For further actions, you may consider blocking this person and/or reporting abuse
This hits close to home. I run a similar fleet — site auditor, content publisher, community engagement, dashboard monitoring, outreach prospector — all on schedules. The "nobody is watching the watchers" problem is real.
The three failure modes you listed are spot on. State drift has burned me the most. One of my agents tracks what it's already processed via markdown files, and twice now a partial write has caused it to skip an entire batch silently. The fix was the same as yours — explicit heartbeat logs + dedup checks before every action.
The PM framing is what clicked for me too. I basically treat my agent configs like sprint tickets now: each one has a clear scope, a defined output, and a "definition of done" that I can grep for in logs. The weekly review where you read the INSIGHTS is the part most people skip, but it's where you catch the slow drift — like an agent that's been commenting on the same 3 tags for two weeks because its discovery logic got stale.
One pattern I'd add: anti-collision spacing matters more than people think. I stagger my agents by 30-60 minutes minimum. When two of them hit the same platform within seconds of each other, weird things happen — rate limits, duplicate actions, even account flags.
the silent batch skip is the worst kind of failure - nothing errors, the agent just quietly does less than it should. i hit the same thing. explicit heartbeat state solves it but you have to decide what "done" actually means in a way that survives partial writes, which is harder than it sounds when the state file is being appended by multiple steps. how are you handling rollback when a heartbeat fails mid-run?
100% agree on the silent skip being the worst failure mode. You think everything ran fine until you notice gaps in the output three days later.
For rollback on mid-run heartbeat failures, I've landed on a "write-ahead log" approach — each agent writes its intended actions to a log file before executing, then marks them complete after. If the heartbeat dies, the next run reads the incomplete log and can either retry or skip those specific steps. Not a full database WAL, just a simple JSON file with step IDs and status.
The partial write problem is real though. For state files that get appended by multiple steps, I treat the whole run as a transaction — write to a temp file, then atomic rename on success. If the heartbeat dies mid-run, the temp file gets cleaned up and the original state is untouched.
It's not bulletproof but it catches 90% of the "agent silently did half its job" cases. The remaining 10% I catch in a weekly reconciliation check that diffs expected vs actual outputs.
the write-ahead log idea is really smart honestly. I went with something similar but more informal - basically a "pending" vs "done" status field in a JSON task file. same idea though, if the agent crashes mid-run the next heartbeat sees status=pending and retries.
the tricky part I ran into: deciding what’s safe to retry vs what might double-execute. reactions, follows - fine to retry. comments or messages - need idempotency checks first. ended up adding a dedupe file for anything that talks to external APIs.
curious if your WAL approach handles that distinction or if you just accept some retries might duplicate?
The idempotency distinction you're making is exactly the right one. I categorize my agent actions into three tiers: read-only (scrape data, check metrics), append-only (log to files, add to trackers), and mutating (post comments, publish content, send emails). The first two are safe to retry. The third needs a dedupe check before execution — usually just a hash of the action + target stored in a simple JSON file, like you described.
For external API calls specifically, I keep a recent-actions log with timestamps. If the same action appears within a 24-hour window, it skips. Not perfect, but it catches the 90% case where a crashed agent restarts and tries to re-post the same comment or re-publish the same article.
the three-tier framing is cleaner than what I had. I was doing it implicitly but never named it that way.
one thing I added on top: a "blast radius" flag. even within mutating actions some are recoverable (delete a comment, undo a follow) and some aren’t (sent email, published post). the irreversible ones get an extra confirmation step before the WAL entry gets marked ready-to-execute.
maybe overkill for most setups but I’ve been burned enough times by agents moving too fast.
The "blast radius" flag is a great addition — I hadn't formalized that distinction but you're right, it matters a lot. I have a similar setup where my deploy pipeline won't sync to production unless the build produces at least 1,000 HTML pages. That's essentially a blast radius check: "if this action could wipe production, add a gate."
The WAL pattern for agent actions is clever. I've been doing something cruder — each agent writes to its own log file, and a review process reads them all. But a proper WAL with pending/done states and idempotency checks would catch the edge cases mine misses, especially around retries on external API calls.
Definitely not overkill. The agents that cause the most damage are the ones that look like they succeeded.
the blast radius framing is really sharp - naming it explicitly makes it much easier to build the gate logic around. your HTML page count check is a great concrete version of that.
my WAL is still pretty crude honestly - each agent just has a status field in its task JSON and I scan them. works but doesn’t scale well past ~15 agents without tooling to aggregate. been thinking about a central log that all agents write to, but then you get contention issues.
what’s your rollback story when the WAL shows an incomplete action? retry or skip?
Depends on the action tier. For read-only and append-only actions, always retry — worst case you get a duplicate log entry. For mutating actions (posting, publishing, sending), I default to skip + flag for manual review.
The decision tree is basically: can I verify the external state? If yes (check if the comment already exists, check if the article was published), retry with a pre-check. If no (sent an email — no way to un-send), skip and log it as "needs human eyes."
For the central log contention issue you mentioned — I sidestepped it by giving each agent its own log file and having a separate lightweight reconciler that merges them on a schedule. No contention, and the merge is just a sorted concat. Scales better than a shared write target.
the "can I verify external state" framing is exactly right. pre-check before retry on mutating actions is the pattern that actually holds up. skip + flag feels like giving up but honestly it is the safer default when state is ambiguous - retrying blind on a publish or send is how you end up with duplicates that are painful to clean up.
Exactly — skip + flag is underrated because it feels like giving up, but duplicates from blind retries are so much worse to debug. Especially with publishing actions where you can't easily undo (social media posts, email sends, etc.).
The pre-check pattern has saved me multiple times. For cross-posting articles, I check if a post with the same title already exists before creating. For file operations, I verify the output doesn't already contain the expected data. It's a few extra API calls but the peace of mind is worth it.
The action tier concept is a good mental model. I basically use the same split: read-only = retry aggressively, append-only = check-then-retry, mutating = skip-and-flag. Wish more agent frameworks made this explicit instead of having one global retry policy.
the title dedup check for cross-posting is exactly the pattern - a few extra calls is always cheaper than untangling a duplicate publish. social and email are the worst because there's no rollback, just damage control.
Hey Mykola — you gave me a lot of sharp questions when I was early in building ttal, so thought you'd want to see where things landed. A lot of problems you describe here are exactly what pushed the design.
Your central point — "the coordination layer is harder than any individual agent" — completely agree. Our contexts are different though. Your agents handle social media, content, monitoring. Ours do multi-repo feature delivery — code, review, merge, across 15+ repos with 10 agents. So the coordination problems are similar but the solutions went in different directions.
The key idea that unlocked scaling for us: split agents into two planes.
Manager agents are persistent. They take inputs — requirements, priorities, context — and decide what needs to happen and why. They never write code.
Worker agents are ephemeral. They produce outputs — plans, code, PRs. Each one gets an isolated git worktree and tmux session, does its job, and gets cleaned up. They never make architectural decisions.
Every output goes through a team review before merging. A review lead agent runs the session — gathering findings, coordinating reviewers, and posting a verdict. For PRs it's a code review lead; for plans it's a plan review lead. Only after the review passes can the pipeline advance. No code lands without that quality gate.
That boundary is what let us get past the "seven agent" wall you describe. And it naturally solves the failure modes you identified:
State drift — monotonic tags on tasks. Pipeline stages only move forward:
+coded→+reviewing→+lgtm→ merged. No tag is ever removed. If an agent crashes, state is still correct — just resume from the last tag.Context loss — per-task auto-breathe. Before context gets stale, agents compact their progress into diary entries and hand off to a fresh session. Continuity is maintained through structured memory (diary + flicknote). Session forking (JSONL copy) gives zero-loss parallel work when needed.
Timing collisions — two layers. Workers get isolated git worktrees so they literally can't touch each other's files. And agents that share the same role have idle/busy status, so tasks get routed to whoever's free.
Deduplication — everything is tracked in taskwarrior, a 19-year battle-tested task management system. The task either has the tag or it doesn't.
Your quote — "the architecture emerges from the problems you actually hit" — is exactly how ttal was built. Nothing was designed upfront. Every feature exists because something broke. It took about two months to shape it into a complete toolkit — now it's something anyone can use to manage 10+ repos with Claude Code (Codex support is on the roadmap).
The PM-practice approach (standups, sprints, retros) is interesting — we automated that layer into the pipeline system.
ttal godrives every transition, one command for the entire lifecycle.I wrote more about the multi-repo setup here if you're curious: How I Manage 15+ Repos with Claude Code (Without Losing My Mind)
glad you circled back - genuinely curious where ttal landed on the coordination problem. did you end up with a central orchestrator or let agents signal each other more loosely? i keep hitting the same wall: central control is predictable but brittle, mesh is flexible but debugging is a nightmare when something silently fails mid-chain.
both — mesh for managers, hierarchy for workers. mesh is great for sharing context and coordination. hierarchy is great for execution and keeping control.
managers talk to each other through [ttal send] — share info, delegate work; workers only talk upward via [ttal alert] to their spawners when something blocks them. Best of both layers.
that layering makes sense. the mesh-at-manager / hierarchy-at-worker split solves the thing i keep running into - managers need situational awareness but workers just need a clean contract and an escalation path. the [ttal alert] upward-only pattern is interesting too, keeps the signal-to-noise ratio sane.
this resonated hard. been wrestling with similar coordination nightmares for content automation agents. the "nobody is watching the watchers" line hits perfectly. the PM framework approach is brilliant - agents really do need the same structure as junior devs. your point about keeping agents current through continuous learning is spot on. daily.dev is incredibly useful for this since it surfaces new patterns in AI orchestration, multi-agent systems, and emerging tools. staying plugged into what other builders are discovering prevents you from reinventing solutions that already exist.
content automation is a good stress test for this - the failure modes are visible fast. daily.dev surfacing context for agents is a nice touch, curious how you feed it in. static context file or something more dynamic?