Is resumable LLM streaming hard? No, it's just annoying.

The state-of-the-art in LLM streaming is surprisingly bad.

Say you go to Google’s Gemini UI, start a conversation, and then switch to a new chat. When you come back to your original chat, the stream will be completely dead, until you refresh the page. Claude does a little better, but not by much: if you start a conversation and then refresh mid-stream, it will send you back to the home page! With both providers, the only way to continue your conversation is to wait and guess until you think the stream is done, then refresh the page.

We wanted to do better at Stardrift. Actually – we had to. Our tasks run for minutes at a time: our AI travel agent does dozens of tool calls, searches and then does more data enrichment mid-stream as it answers user queries. We need to show the user exactly what is happening as they wait, and have it still work if they tab away and then come back.

We looked around for a good guide or library, and were surprised to find… nothing. So here’s our guide to implementing good, resumable LLM streams, along with a story.

The success criteria, or: what is a 'resumable stream'?

We build a chat app. Here were our criteria:

Refreshing a chat tab mid-stream should not interrupt the stream
Switching between chats mid-stream should not interrupt the stream
Navigating to other sections of your app mid-stream should not interrupt the stream
Momentarily dropping internet connection should not interrupt the stream

We also wanted to to ensure at most one stream was active per chat.

These criteria seemed like the bare minimum. Streaming text responses are the backbone of a good UX in any chat-based application. Without them your app will feel sluggish, and if they're handled poorly they’ll feel sloppy. You might have incredible value, but if you lose your users to a layer of poor stream handling, they’ll never know it.

How did we get there?

How do you design an agent-chat application to be resilient to interruptions and network failures? The solution is pretty obvious, and a surprising amount of work. We kept hacking our way around it, so it took us a few months to get there.

Step 1: Our MVP - no resumable streams

In our first version of Stardrift, we didn’t bother with resumable streams. This is probably the right approach for 90% of LLM chat apps.

We run a Next frontend on Vercel with a FastAPI backend that runs on Modal. We use Vercel’s AI SDK protocol to communicate between them over server-sent events, or SSEs. (There is no native Python library, so we wrote our own.)

To start streams, we’d simply hit a FastAPI backend endpoint, which would query the LLM and stream back the response. We would also simultaneously store responses in the database. If you refreshed the page, we lost the frontend’s connection to the backend, so we’d just show you whatever was in the database at the time.

Diagram 1

This wasn't ideal, but we all have to start somewhere!

Step 2: Streamstraight to the rescue

Quickly, our lack of resumption became an issue. When we did demos, our users would complain that the app didn’t work.

Our app did work! But it ran for a long time, so our users would tab away to something else and then come back to an interrupted chat window. The conversation appeared if they refreshed, but only our most committed users tried that.

We knew what the ‘real’ fix was, which was to rip the LLM streaming logic out of the FastAPI server and into its own process. But that brought communication and orchestration overhead, and we wanted to avoid adding complexity to our system for as long as possible.

Happily, it turned out one of our friends had written a nice solution for this. Streamstraight was a plug-and-play solution that integrated with just a few lines of code into our FastAPI backend and gave us resumable streams! It also gave us a nice frontend utility to hook into.

To use it, our backend wrote the stream chunks to Streamstraight, which acted as a fallback channel and continued the stream over websockets if the client connection was interrupted:

Diagram 2

Everything was great, and this worked well for several months.

Step 3: We accidentally did an infra rewrite anyway

At some point, we hit a different issue: we needed to create a quick Stardrift demo. To do this, we built out cached, demo cards, which you can see at the bottom of our home page. (These were inspired by a similar demo on the Manus website.) When you click on one of these, it shows you a sped-up, cached conversation:

We use a few pre-defined prompts for these, but we regenerate these conversations multiple times a day, so that price and availability information stays fresh.

Effectively, we needed a way to kick off a ‘trip planning request’ without user input. The easiest way to build this was to tear our conversation code out of the FastAPI server and put it in its own worker process, which we could trigger whenever we wanted. Since we’re built on Modal, orchestrating this was very easy; we just put the streaming logic in its own function and then called modal.spawn, which quickly spins up a tiny container for each request. Easy, stateless and scalable!

The missing link was getting the agent’s stream from the worker nodes back to the frontend. We did this using Redis streams. The worker dumps chunks in real time to a Redis stream which our backend then subscribes to on demand, relaying them to the frontend client as SSE. We kept using Streamstraight in our codebase for our resumable streams and even launched with this architecture.

Diagram 3

That said, if you look at this and notice something’s up, you’re not alone. It turns out we accidentally built most of what we needed for in-house streaming resumption: the Redis streams were buffering the agent’s response independently of the SSE connection between frontend and backend. All that was left was a way to pick it up again from the front end if the connection dropped mid-stream…

Step 4: True end-to-end stream resumption

The final migration was non-trivial. One of the reasons streaming is difficult is because it isn’t just a backend systems problem; it also requires decent React knowledge.

We use Vercel’s AI SDK’s useChat hook (v5) to manage our chat interface. Under the hood it defines a transport class respgonsible for handling streams from the browser’s networking layer to the chat component. It's a pretty nice abstraction - here's what it exposes:

const {
  messages,
  setMessages,
  sendMessage,
  status,
  stop,
  error,
  regenerate,
} = useChat({
  initialMessages
  resume: true,
  transport: new StardriftTransport(),
  id: chatId,
});

To implement our stream resumption, we implemented our own transport class which lets us define a hook called reconnectToStream. This triggers anytime the useChat component re-mounts to the existing chat (for instance, after a page refresh):

export class StardriftTransport<UI_MESSAGE>
  extends DefaultChatTransport<UI_MESSAGE>
  implements ChatTransport<UI_MESSAGE>
{
  async reconnectToStream(options: {
    chatId: string;
  }): Promise<ReadableStream<UIMessageChunk> | null> {
    // reconnect to stream logic here
  }
}

This seemed great! We could simply override reconnectToStream so that it made a request to our backend for the existing message’s stream and returned the SSE stream!

The caveats

This initial naive approach didn’t work, though, because of two issues:

The useChat hook doesn’t give you message_ids for a response until its stream is complete. So we have to track what messages are currently being streamed.
reconnectToStream is called every time a chat component re-mounts. This means that even when the stream has completed and is no longer active, the transport will still try reconnecting to a live stream. So we need to track when a message’s stream is done.

The solution

So what could we do to deal with this?

We could create some frontend state to track this information. But this would add clutter and weight to an already complex frontend.

We could keep this state in our backend server, but then we would lose the nice stateless properties that allowed us to scale our backend so easily in the first place.

We ended with a tried-and-true solution: shoving complexity into Redis. To solve this, we created a dead-simple Redis key-value store, which mapped chat_id's to the active ephemeral message_id's.

When the request for an agent’s response first comes in, we write chat_id → message_id to a Redis cache and kick off our agent loop.

Step 4-1

As the chunks come in, we write them in real-time to a Redis stream keyed by a combination of the chat_id and message_id, to which the backend is subscribed to and can relay them to the frontend as SSE. (This lets us separate different assistant responses in each chat.)

Step 4-2

On every new message, we update the active message_id for the given chat_id. When the response is complete, we delete the chat_id → message_id mapping.

Do we need a separate key-value store for the chat’s status?

Short answer, yes. The truly annoying aspect of streams and pub/sub systems in general are all the race conditions that come up.

So if you (like us) are thinking “why a whole separate Redis store just to track the status of a chat’s stream? Can’t we just look at the Redis stream directly?”, here’s why:

Even if we enforce that any given chat_id has only one active Redis stream at any given time (by having the worker delete the entire stream as soon as the agent is done), we still run into a race condition. The backend (as the subscriber) may not have read the final chunk yet when the stream is deleted, and therefore can’t reliably know that the stream is complete. If we instead move the delete operation to the backend, we tightly couple the subscriber to the stream's lifecycle, loosing the independence that's at the core to this flow.

In a more practical sense, though, the separate Redis store gives us an extra degree of flexibility in controlling our flow: we can explicitly track the state of streams from 'pending' to 'ongoing' to 'complete', and any other additional statuses we may want to know along the way.

Let’s say there’s an internet problem and the SSE connection drops. The worker node is entirely unaware that this happens and keeps adding chunks to the stream:

Step 4-3

When useChat remounts, reconnectToStream is triggered and makes a request to our backend.

On the backend, we first look up the chat_id in this key-value store and ask: “is there an active message for this chat?” If there’s nothing, the stream is done and we break early.

Otherwise, we’ll get some active message_id, which we can combine with the known chat_id to pick up the Redis stream keyed by their combination, returning it to the frontend as SSE and resuming the stream!

Step 4-4

Final architecture

In the end we were left with this mostly-reasonable architecture:

Final Architecture

We'll probably replace the useChat hook in the end. Many problems were caused by trying to work around it. But, as this story demonstrates, in LLM streaming you have to take the bad with the good, and useChat continues to solve a lot of problems for us. So it lives another day.

Closing thoughts

We’re sharing this as a demonstration of 'best-practice hacking'. Nothing here is rocket science, but getting to this point took some iteration and willingness to adapt. Each change was thoughtful, and their combination culminated in a robust system.

While our use case is specific (multi-minute agentic tasks that require real-time transparency), the principles we learned apply broadly. Start simple, iterate when needed and pounce on easy opportunities to refactor when possible.

Of course this wouldn’t have been as easy if the rest of our backend stream processing wasn’t built up to high spec. It’s beyond the scope of this post, but we do some pretty cool stuff in our agent process!

We run into a lot of other interesting technical challenges in our day-to-day at Stardrift which we’d love to talk about. If any of this stuff sounds interesting to you, check out our careers page!. And subscribe below to get future blog posts on the rest of our stack!

Thank you Leila Clark, Hansen Qian, Gilberto Mautner, Alec Leng, Betsy Pu and Sheon Han for comments on this piece!

Stay in the loop

Subscribe to get future posts about Stardrift.

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.