Stellar Cafe at Warp Speed
We’ve been hard at work improving the conversation experience within AstroBeam AI (our framework for building voice/LLM driven games). AstroBeam AI powers our NPC conversation within Stellar Cafe.
A big part of conversation quality is speed of response (time from your last utterance to the time you hear the first verbal response). When we launched Stellar Cafe we were hitting somewhere between 1.6-1.8 seconds. This was just enough to have the feeling of conversation, but realized if we could somehow go faster it would feel even better.
The challenge was significant. We had already squeezed out a lot to get to this current speed, making us the fastest-responding voice NPCs in a video game. This challenge stems from the huge volume of work that needs to be done. Sending audio/data from your device -> performing speech-to-text -> determining end of speech -> running LLM for reasoning with new state -> running safety models -> running text-to-speech -> finally streaming it back to the player.
We recently had some big breakthroughs optimizing the stack and here are some of the highlights!
Gemma 4
The biggest update was switching from Gemini 2.5 Flash to Gemma 4 31B. Moving to Gemma 4 31B not only gives us better reasoning, it allows us to be much faster on TTFT (time to first token) and more consistent in response times. Reworking our prompts for this new model was a significant undertaking, but it ended up being very worth the effort.
We want to give a special thanks to LiveKit for hosting a super fast version of Gemma 4 31B.
Inworld Realtime TTS-2
We also upgraded to Inworld Realtime TTS-2, not only does this make our time to first audio faster, it increases the quality and expressiveness of speech. This is a big improvement and we are just scratching the surface.
Parallel Safety Model
Traditionally safety models are running immediately after the LLM is finished running. The problem with this is you’re adding more time to your TTFT for your response. So instead of running the safety model directly after the LLM, we send that response to the TTS (text-to-speech) and in parallel run our safety model. As the safety model is faster than TTS it adds no additional latency in this setup. If the safety model comes back with a flag, we simply stop the TTS and re-run our LLM. This also means we can run a more robust safety model as we have more time within this parallel safety model setup.
The end result is we are now hitting response speeds of 1.15-1.4 seconds! That’s a whole 30% faster. Most voice-to-voice AI you experience doesn’t come close to that speed and this is within a video game at low cost.
Try it for yourself, now on Meta Quest metaque.st/4lr5Hnp
We have lots of new announcements coming and can’t wait to share!