Why SSE for AI agents keeps breaking at 2am
Every team building AI agent UIs writes their own SSE client. And every team hits the same four bugs.
I know because we shipped 36 agent tools at Praxiom before we sat down and wrote a real protocol instead of patching the same streaming code for the fifteenth time. This is a post-mortem on the four bugs. At the end I'll show you what we extracted.
The setup
You're building a chat-style UI backed by an LLM agent. The agent calls tools, thinks for a few seconds, maybe runs multiple turns. You want the frontend to stream tokens in real-time, show "running web search..." while a tool is active, and display a progress bar for longer operations.
SSE seems like the obvious choice. It's simple. You've used it before. You write the server in an afternoon.
Then you go to production.
**Bug #1: The chunk boundary**
Here's the hand-rolled SSE parser most people write:
for await (const chunk of stream) {
const text = decoder.decode(chunk);
const lines = text.split('\n');
for (const line of lines) {
if (line.startsWith('event: ')) {
currentEvent = line.slice(7);
} else if (line.startsWith('data: ')) {
dispatch(currentEvent, JSON.parse(line.slice(6)));
currentEvent = ''; // reset
}
}
}
This works in local dev. The event: and data: lines arrive in the same chunk because there's no network latency.
In production, under load, with a real network, a proxy, or nginx in the path — they don't.
Chunk 1 arrives: "event: token\n"
Chunk 2 arrives: "data: {\"text\":\"Hello\"}\n\n"
Your parser resets currentEvent after chunk 1. When chunk 2 arrives, currentEvent is "". The event is dropped silently. Your tokens disappear in production but never in staging.
The fix: currentEvent must survive across reader.read() calls. It's not a per-chunk variable — it's a per-stream variable. Reset it only after the data: line is dispatched, not at any chunk boundary.
// Outside the chunk loop — survives across reads
let currentEventType = '';
for await (const chunk of stream) {
// ... parse lines ...
// Reset ONLY after data: is dispatched
if (line.startsWith('data: ') && currentEventType) {
dispatch(currentEventType, JSON.parse(line.slice(6)));
currentEventType = ''; // reset HERE, not at chunk boundary
}
}
Bug #2: 30 React renders per second
Claude 3.5 Sonnet emits roughly 25–35 tokens per second. Without any batching, each token event directly updates state:
onToken: (e) => setText(prev => prev + e.text)
That's 30 setState calls per second. React batches some of these in concurrent mode, but not reliably under high frequency. What you get is visible jank — the text renders choppy, other UI elements freeze, and on slower devices the whole component tree starts missing frames.
The fix isn't complicated. Accumulate tokens into a buffer and flush on an interval:
let buffer = '';
let lastFlush = Date.now();
const INTERVAL_MS = 50;
onToken: (e) => {
buffer += e.text;
const now = Date.now();
if (now - lastFlush >= INTERVAL_MS) {
setText(prev => prev + buffer);
buffer = '';
lastFlush = now;
}
}
// On stream end, flush remainder
onDone: () => {
if (buffer) setText(prev => prev + buffer);
}
50ms gives you 20 renders per second — smooth to the eye, fraction of the CPU cost. The only subtlety: make sure you flush the remainder on stream end, or the last few tokens never appear.
Bug #3: The loading state that never resolves
Your server looks like this:
async def stream_agent(request):
async def generate():
async for event in agent.run():
yield emitter.token(event.text)
yield emitter.done() # <-- this line
return StreamingResponse(generate())
That done event is what tells the frontend to set isStreaming = false. But what happens when the server crashes mid-stream? An unhandled exception in your agent loop. A memory error. An upstream API timeout that your error handling missed.
The done event is never emitted. The SSE connection closes. Your frontend detects the closure... and does nothing, because "connection closed" and "stream finished" look the same from the client side.
The spinner keeps spinning. The user stares at it. Eventually they reload.
The fix: synthesize a done event client-side when the connection closes without one:
// After the read loop exits normally or via error
if (!receivedDone) {
callbacks.onDone?.({ synthetic: true });
setState(prev => ({ ...prev, isStreaming: false, isDone: true }));
}
The UI recovers cleanly. You log the synthetic done event server-side as a signal that something went wrong upstream.
Bug #4: Retry logic that makes things worse
The standard reconnect implementation retries on any connection failure. But there are two very different kinds of failures:
HTTP errors (4xx/5xx): The request reached your server. The server said no — bad auth token, rate limit, your request body was malformed, the endpoint changed. Retrying the exact same request will get the exact same error. You're just hammering your own server.
Network drops: TCP connection closed mid-stream. The client never got a response, or got a partial one. This should retry — it's likely transient (user's wifi dropped, proxy timeout, load balancer cycle).
Most hand-rolled retry logic doesn't distinguish between them:
// ❌ Wrong — retries on 403, hammers server, wastes tokens
catch (error) {
setTimeout(retry, 1000);
}
The correct split:
const response = await fetch(endpoint, options);
if (!response.ok) {
// HTTP error — throw immediately, no retry
throw new HttpError(response.status, await response.text());
}
// Past this point: we have a 200 and are reading the stream
// Any failure here is a network drop → retry with backoff
try {
await readStream(response.body);
} catch (networkError) {
if (attempt < MAX_RETRIES) {
await sleep(Math.pow(2, attempt) * 1000); // 1s, 2s, 4s
return retry(attempt + 1);
}
}
HTTP errors surface immediately to the user. Network drops retry silently up to 3 times. Your error handling for a 403 Forbidden is fundamentally different from your handling for a dropped connection.
The same five events, every time
After shipping 36 agent tools at Praxiom, we noticed something. Every tool needed to emit:
- Tokens accumulating into the response text
- Tool calls and their status (running → done / error)
- Thinking blocks (for extended thinking models)
- Progress for multi-step pipelines
- A clean end signal with metadata
- And every frontend needed to consume them with the same state shape: text, isStreaming, activeTools, progress, error, isDone.
We were rediscovering the same edge cases on every new tool. The token batching tweak happened three separate times before someone documented it. The chunk boundary bug was fixed in four different files.
So we extracted it.
agent-stream
A typed SSE event protocol for AI agents. Nine event types. Python emitter. React hook. JSON Schema spec.
pip install agent-event-stream
npm install @agent-stream/react
Python — emit from any async generator:
from agent_stream import AgentStreamEmitter
from agent_stream.fastapi import agent_stream_response
emitter = AgentStreamEmitter()
async def run_agent(message: str):
async for chunk in anthropic_client.stream(message):
yield emitter.token(chunk.text)
yield emitter.tool_use("web_search", tool_id, "searching...")
# ... run tool ...
yield emitter.tool_result("web_search", tool_id, "found 3 results", duration_ms=850)
yield emitter.done(num_turns=2, tool_count=1, duration_ms=3200)
@app.post("/chat")
async def chat(req: ChatRequest):
return agent_stream_response(run_agent(req.message))
React — full state from one hook:
const { text, isStreaming, activeTools, progress, error, isDone, startStream } =
useAgentStream();
return (
<div>
<p>{text}{isStreaming && <Cursor />}</p>
{activeTools.map(tool => <ToolBadge key={tool} name={tool} />)}
{progress && <ProgressBar value={progress.percentage} label={progress.message} />}
<button onClick={() => startStream('/chat', { message })}>
Send
</button>
</div>
);
All four bugs above are handled in the library. Cross-chunk parsing is correct by construction. Token batching is on by default (50ms). Synthetic done fires when the server drops the connection. Retry logic distinguishes HTTP errors from network drops.
The JSON Schema spec (spec/events.schema.json) means you can implement the protocol in any language. It's not a React-only thing — we have a FastAPI server and the client is a plain TypeScript class that works in any framework.
What's next
We're building more of these extracts out of Praxiom's infrastructure — the parts that turn out to be the same across every AI product. agent-stream is the first.
If you're hitting these bugs, or if you've hit others we haven't documented — open an issue. The hard-won production details are the most valuable thing we can contribute.
→ github.com/abhichat85/agent-stream
Extracted from Praxiom - www.praxiomai.xyz
Top comments (10)
Bug #1 (chunk boundary) is one of those things that bites you exactly once in production before you learn to never trust chunk alignment again. I run about a dozen AI agent tasks daily on a large static site pipeline, and the "loading state that never resolves" (Bug #3) was probably the most painful to debug — you end up adding synthetic timeouts everywhere as a safety net before you realize the real fix is synthesizing the done event client-side.
Curious about your token batching approach — do you find 50ms is the sweet spot across different models? I noticed faster models (like Claude 3.5 Haiku) can push even higher token rates, and I wonder if adaptive batching based on token velocity would be worth the complexity.
On 50ms — honest answer: it's a pragmatic default, not a derived number. 50ms at 30 tokens/sec gives you ~1.5 tokens per flush, which renders smoothly without the UI feeling laggy. But you're right that Haiku changes the math — at 60-80 tokens/sec you're batching 3-4 tokens per flush which still feels fine, but at the tail end of a burst you might get 8-10 tokens landing at once, which can feel slightly chunky on the first render.
Adaptive batching is interesting. The naive version would be: measure inter-token interval over a sliding window of 5 tokens, then set flush_interval = clamp(inter_token_ms * 2, 16, 100). Fast model → short interval → more frequent renders. Slow model → longer interval → less thrash. Probably 20 lines of code.
The complexity I'd worry about isn't the implementation — it's the debugging. Static interval means predictable render timing, which makes profiling straightforward. Adaptive interval means your render frequency changes mid-stream, which makes "why did the UI stutter at exactly this point" harder to answer.
For a static site pipeline where you're not rendering in a browser, though, the batching probably doesn't matter at all — you care about final output, not intermediate render frames. Curious what your consumption looks like on that side — are you piping tokens into a file writer or something more structured?
That adaptive batching formula is clever —
clamp(inter_token_ms * 2, 16, 100)is a nice pragmatic approach. You're right that the debugging complexity is the real cost though. When something stutters in production and your flush interval is a moving target, you're basically adding a second variable to every rendering investigation.For my use case (static site generation pipeline), you nailed it — I'm consuming the full output as a batch, so streaming behavior is mostly irrelevant. The SSE pain points I hit are more around connection lifecycle: zombie connections after network blips, and the retry behavior when the server closes gracefully vs crashes. Those are the 2am bugs that inspired the original article.
Appreciate the detailed math on token batching. That's the kind of concrete analysis that actually helps people make the static vs adaptive decision for their specific context.
Thanks man. Go ahead and play around with the repo. Everything open source.
Great post-mortem. The chunk boundary bug is especially insidious because it's literally impossible to reproduce locally — one of those "works on my machine" bugs that only manifests under real network conditions.
We see a similar pattern when scoring APIs for autonomous agent use: the reliability problems that matter most are the ones that only surface in production under load, not the ones visible in getting-started tutorials. The SSE reconnection gap you describe (Bug #3) is a perfect example — an agent that misses a tool-result event because the SSE stream dropped and reconnected doesn't know it missed anything. It just hangs.
Curious whether you considered WebSockets for the bidirectional case (agent sending tool calls while streaming responses), or if SSE + structured events was sufficient for most of your 36 tools.
The "agent doesn't know it missed anything" framing is exactly right and it's subtler than most people realize. The client reconnects, the stream resumes, but the state machine on both sides has diverged — the server is mid-tool-result, the client just started fresh. Without a session ID and a sequence number on every event, you have no way to detect the gap. We don't solve that in agent-stream today. It's on the list, but honestly it requires the server to buffer recent events, which is a bigger infrastructure commitment than most teams want for v0.1.
On WebSockets — we considered it seriously. The honest reason we didn't go there: in practice, tool calls in our 36 tools are all server-side. The agent decides to call web_search, the server executes it, the server emits tool_use/tool_result back downstream. The client never initiates mid-stream. So the "bidirectional" requirement we actually had was: client sends initial request, server streams back a long multi-turn response. HTTP POST + SSE covers that cleanly.
Where WebSockets would win: human-in-the-loop cases where the client needs to inject input mid-stream — answering a clarifying question, approving a tool call before execution, or cancelling a specific branch of a parallel agent. We have a stub for an input_required event type that would signal this, but the client-side response path isn't there yet. That's the point where SSE starts to feel like you're fighting the protocol.
Your reliability framing resonates — the failure mode that actually matters in production isn't "API returned 500", it's "agent completed with wrong state and nobody noticed". The done-without-result case is probably the most common one we see. Curious what your scoring approach looks like for that — are you diff-ing expected vs actual event sequences, or something more semantic?
The AN Score approach is more semantic than sequence-diffing, though the underlying concern is the same. For streaming/SSE surfaces specifically, we look at: (1) whether error responses are structurally differentiated from partial completions (a done-without-result is a different failure class than a mid-stream 500), (2) whether the provider surfaces recovery metadata in the stream (event IDs, sequence numbers, checkpoint tokens), and (3) whether reconnect behavior is documented and consistent. Most providers score poorly on (2) because it requires server-side buffering, as you noted. The done-without-result failure mode you described maps to a specific execution dimension deduction in our scoring — APIs that can complete a request without signaling outcome get penalized because agents cannot distinguish success from silent failure. Sequence diffing would be cleaner but requires a reference implementation to diff against, which we don't have for most providers. The semantic approach is: given the documented behavior, can an agent detect and recover from this failure class without human intervention?
SSE connection drops at 2am are almost always load balancer idle timeouts — the AI agent keeps processing but the connection dies silently. The fix that worked for us: aggressive client-side heartbeat pings every 15s + reconnect-with-resume logic keyed on server-sent event IDs. Most SSE tutorials skip both.
Bug #4 resonates a lot — the HTTP vs network error distinction is easy to overlook until your own server starts getting hammered.
We ran into a similar pattern when building streaming for AI agent monitoring. The retry-on-any-failure approach works fine in dev but becomes a real problem under load when 4xx responses start cascading. Adding exponential backoff helped, but distinguishing error types first was the real fix.
The synthetic "done" event on connection close (Bug #3) is a nice pattern too — we ended up doing something similar with a timeout-based fallback when the stream stalled.
Good write-up.
The cascading 4xx pattern is one of those failures that only becomes visible at scale — in dev you might get one 429 and it retries successfully, but under real load the retry itself contributes to the overload and you end up in a feedback loop. The HTTP/network distinction is the gate that breaks it.
On the timeout-based fallback vs synthetic done — they're solving the same problem but with different failure modes. A timeout fires late by design, so your UI hangs for the full timeout window before recovering. The synthetic done fires immediately on stream close, so the UI snaps back the moment the connection drops. The tricky part is that reader.read() resolves with {done: true} on normal close and on crash — you only know to emit the synthetic event if you reach that condition without having seen a real done event. Once you see that the condition is "exited read loop without done", the implementation becomes obvious and the timeout disappears entirely.
Curious about the monitoring angle — when you say AI agent monitoring, are you watching the stream events themselves (token rates, tool call latency, turn counts) or the downstream effects (agent outputs, user signals, business metrics)? Asking because the .jsonl recording format we just shipped to agent-stream is essentially a structured trace of every stream session with millisecond timing, and I keep wondering whether that's the seed of something more systematic for agent observability. The format is greppable and diffable today, but it's one schema step away from being queryable across sessions.