TTS MCP server for Home Assistant
Worth the buffer rewrites. The streaming latency target was the whole game.
What I was trying to do
Home Assistant’s built-in TTS pipeline is fine for short utterances but felt sluggish for anything conversational. I wanted Claude to be able to speak through HA — meaning the agent loop should be able to call a TTS tool, get audio streaming back, and have it route to the right Sonos zone with sub-second first-byte latency. The existing options either buffered the whole utterance before playing (too slow) or required cloud TTS (not the point).
How Opus helped
Opus and I spent about two hours designing the wire protocol before I wrote a line of code. The first version had the MCP server return chunks as base64’d opus-encoded frames over the standard MCP transport. Opus pushed back: the MCP transport isn’t optimized for streaming binary, and chunking large strings would balloon the JSON payload. The eventual design uses MCP to negotiate the stream — Claude calls a speak() tool which returns a transient HTTP endpoint, and the audio flows over that endpoint directly. The MCP call is the handshake; the audio is out-of-band.
That separation of “control plane” and “data plane” was Opus’s call and it was right. It also let me reuse my existing HA media-player integration without rewriting it.
What went sideways
Three things tried to kill the project:
- The first buffer was too clever. I built a ring buffer with mark-and-rewind for handling cancellations. Worked great in tests, deadlocked under contention. Opus walked me through the failure mode by reading the code and saying “what happens if
popruns while the writer is between mark and commit?” — answer: hang forever. Replaced it with a simpler bounded queue. - Piper’s chunk boundaries didn’t line up with phoneme boundaries. First playback sounded like a robot hiccuping every 200ms. Spent an evening adding a small (50ms) cross-fade in the player.
- Home Assistant’s media_player.play_media expects a URL, not a stream. I tried to fake it with a
data:URL for the first chunk and got laughed at by my own logs. Ended up running a tiny FastAPI sidecar that exposes ephemeral URLs and proxies the stream from the MCP audio endpoint.
Result
First-byte latency hovers around 280ms on the local network. Conversational pace feels right. The MCP server runs as a systemd unit on vlinux05 and HA discovers it via a small bridge config. I now have a Claude voice talking through the same speakers as Alexa, which is funny and slightly satisfying.
Code in the repo, including the bench script that proved the latency target.
Files & links
- Source:
tts-mcp-serveron GitHub (see repo link above) - HA config example:
examples/ha-config.yamlin the repo - Bench:
bench/streaming_bench.py