Back to Blog
Releases

vui — An Open-Source Jarvis

Harry Coultas Blum
May 8, 2026
13 min read

We're releasing vui — a real-time, fully-local voice assistant you can extend to do almost anything.

fluxions-ai/vui

Out of the box it's a streaming voice loop on your own machine. But the interesting part is what's behind it: a tool-calling layer that hands real instructions off to whatever agent stack you wire in. Plug it into Claude and anything Claude can touch — Gmail, calendar, your shell, MCP servers, the open web — vui can drive by voice. Plug it into something that can move a mouse and you've got a voice front-end for controlling your computer. Plug it into your own code and it drives that. The model speaks; you decide what it can do.

This is the bit we're most excited about. We believe it's the first open-source project that puts voice as the interface to your tools and your local machine — not a chatbot in a tab, but something you talk to that actually does things, and that you can extend in a weekend. Voice is going to become one of the primary interfaces of human machin interaction. But it needs to be more accessible for people to experiment with and build on top of. We hope this project achieves that.

We wanted something you could docker compose up on your own machine and actually have a conversation with — fast enough to feel natural, small enough to fit on a single GPU, with hooks that can plug it into the agent stack you already have.

What's in the box

  • vui Nano — 300M parameter Llama-style decoder + RQ-Transformer head over the Qwen3-TTS-12Hz codec
  • Streaming server — WebRTC + WebSocket pipeline (ASR → LLM → TTS) with a browser UI
  • OpenAI-compatible Realtime API at /v1/realtime — drop in for any client written against OpenAI's Realtime spec
  • One-shot /v1/voice-note endpoint — push-to-talk, audio in, rendered WAV out. Drop it behind a WhatsApp / Telegram / iMessage bot and you've got a voice-note assistant in an afternoon
  • Optional Claude task sidecar for agentic work — Gmail, calendar, web research
  • Standalone TTS demo if you just want to play with the voice on its own

What makes vui different

  • Trained on real conversations, not voice-actor reads. Most TTS models narrate. vui was trained on two-way dialogue, so the prosody it generates carries the rhythm of an actual exchange — natural pauses, conversational pacing, the small acoustic cues that stop a reply sounding like an audiobook.
  • Controllable breaths and hesitations. As far as we know, vui is the first open model with first-class inline tags for non-verbal sounds — [breath], [sigh], [hesitate], [tut], [gasp] — that you can place exactly where you want them. The little disfluencies that make speech sound real.
  • Probably the most realistic model in its size class. At 300M parameters, head-to-head with the current crop of small open TTS models, we think it sounds the most human. Try it side-by-side and tell us if you disagree.
  • Runs on consumer hardware. ~5.5–12 GB of VRAM depending on which ASR and LLM you pick (and whether you co-locate them on the same GPU) — single 4090 territory either way. Apple Silicon support is on the way (see below).

Talk to your agent — vui + OpenClaw

Point OpenClaw's Talk mode at vui and you've got a fully local voice front-end for your agent. Your voice, your hardware, no OpenAI key, no audio leaving the box.

vui implements the OpenAI Realtime WebSocket spec at /v1/realtime — same events, same PCM16 @ 24 kHz audio format — so OpenClaw's existing openai realtime provider works as a drop-in client. One config swap and you're talking to your agent in abraham's voice instead of an OpenAI one:

"realtime.providers.openai": {
  "baseUrl": "ws://localhost:8080/v1/realtime",
  "apiKey": "not-needed",
  "voice": "abraham"
}

This is the setup we've been running internally and it's the use case we're most excited about — sub-500ms turn-around, memories, and the full OpenClaw tool surface (Gmail, calendar, web, your shell), all driven by voice. If you want a reference deployment, the included Claude task sidecar plays the same role and shows the wiring end-to-end.

OpenAI Realtime, drop-in

/v1/realtime speaks the full OpenAI Realtime event surface — session.update, input_audio_buffer.append, response.audio.delta, the lot — at PCM16 @ 24 kHz. The OpenClaw integration above is one client; the bundled browser UI at / is another. Anything written against OpenAI's Realtime spec works with a baseUrl swap.

Architecture

mic ──► WebRTC ─► VAD ─► faster-whisper ─► Ollama LLM ─► vui Nano ─► WebRTC ─► speaker
                                              │
                                              └─► thoughts stream (parallel tool router)

Three OS processes connected by torch.multiprocessing.Queue: a main aiohttp server handling WebRTC and conversation state, a TTS worker pinned to the GPU running vui Nano + RQ-Transformer + Qwen codec under CUDA graphs, and an ASR worker running faster-whisper or Moonshine.

Streaming features that matter for it actually feeling like a conversation:

  • Sub-500ms turn latency — from end-of-speech to first audio coming back, on a single 4090
  • VAD-driven endpointing to detect when the user has finished a turn
  • Speculative LLM prefill while the user is still speaking
  • Sentence-level TTS chunking with backpressure (LLM waits for TTS to drain)
  • Memories — the assistant can remember facts about you across sessions

What you said, and how you said it

A conversation isn't just the words. It's the rise and fall of a voice, the speed, the breath before someone says "I dunno…". A model that only sees a transcript is reading subtitles — it loses the half of the signal that tells you whether someone is excited, bored, joking, hesitating, or about to interrupt. We think the bar for "real" conversational AI is being able to hear what you said and how you said it, and to let both shape the reply. It's the principle this product is built around and the direction we're pushing the model in.

A small honest note on the same theme: turn-taking is hard, and we don't have it solved. We tried a fair few of the dedicated turn-taking models out there and none of them really worked well enough to ship — too eager, too laggy, or too willing to plough through mid-sentence pauses. vui falls back to VAD-based endpointing for now, which is reliable but blunt. Better turn-taking is an open problem we care about — if you've worked on it, or want to, we'd love to have you contribute. PRs and issues welcome.

The thoughts stream

Agents are slow. A Gmail search, a calendar lookup, a web fetch — any of those takes seconds, and seconds of dead air kills the illusion of conversation. So vui runs a second LLM call in parallel with the main reply: a small, fast "thoughts" model that decides, on every turn, does this need a tool?

If the answer is yes, two things happen at once:

  1. The main TTS reply says something natural — "yeah, let me check…", "one sec…", "hmm, hang on…" — so the user hears a human-paced response immediately.
  2. The thoughts stream dispatches the actual work to the task sidecar (Claude, OpenClaw, your own agent). The result comes back, gets handed to the main loop, and the assistant speaks the real answer.

It's the same as what we do as humans when you're asked something.

Tool calling

The thoughts stream isn't a free-form reasoner — it's a tool-calling LLM with a fixed schema. We hand the model the conversation so far, the current memories, the running task list, and a closed set of tools, and force it to call exactly one on every turn. The model is whatever you've got in Ollama (we run Qwen3 4B by default); the tool schema is the standard OpenAI / Ollama function-calling format, so any modern small open model that supports tool use slots straight in.

The tool surface is deliberately tight — ten functions, grouped by what they do:

  • no_action — the most common call by far. Casual chat, opinions, follow-ups answerable from context, general knowledge the conversation stream can answer.
  • delegate(task) — hand a concrete instruction to the task sidecar (Claude / OpenClaw / your own agent). This is the bridge to your real tools — Gmail, calendar, MCP servers, the shell, the web. The task runs in the background; the main reply stalls with a "one sec…" and resumes when the result lands.
  • add_memory(text, replaces?) / remove_memory(query) / clear_memories — commit, update, or wipe durable facts about the user. Silent — they don't interrupt the conversation.
  • list_tasks / check_task(description) / cancel_task(description) / clear_tasks — let the user steer background work by voice ("what's running?", "is the email thing done yet?", "cancel that").
  • clear_context — wipe the conversation when the user asks to start over.

The system prompt does most of the work. It spells out, with examples, when each tool fires and — crucially — when not to: capability questions ("can you check my email?") are no_action, not delegate; follow-ups about results already spoken aloud are no_action, not a re-delegate; activities ("made pasta") aren't memories, durable facts ("allergic to nuts") are. Getting these boundaries right is what stops the assistant feeling either eager (delegating everything and stalling on every turn) or oblivious (never reaching for a tool when it should). Tuning the prompt is most of the work; we've included an eval harness (eval_thoughts.py) so you can iterate on it against your own conversations.

A couple of latency tricks make this affordable:

  • Speculative prefill. While the user is still speaking, we send the partial ASR transcript through the thoughts model with num_predict: 1 — enough to warm the KV cache so the real call after end-of-turn returns in a fraction of the time.
  • keep_alive: 30m on the Ollama side keeps the model resident; we never pay cold-start.
  • Disruptive vs. silent tools. Memory ops run silently in the background. delegate and clear_context cancel the in-flight conversation reply first, so the assistant doesn't talk over its own tool result.

If you want to wire your own agent in, you only need to implement the receiving end of delegate — point it at whatever stack you have (LangGraph, CrewAI, raw MCP, a shell script). The thoughts model decides whether to call a tool and what to ask; your sidecar decides how to do it.

The model

vui Nano is a 300M parameter autoregressive LM over the Qwen3-TTS-Tokenizer-12Hz speech codec — 16 codebooks of 2048 entries at 12.5 Hz, decoded back to 24 kHz audio. Llama-style decoder, 768 dim, 22 layers, 8 heads, RQ-Transformer head for the codebook hierarchy. Speaker conditioning uses the ECAPA-TDNN encoder from Qwen3-TTS-12Hz-0.6B-Base. Context window: 6 minutes of audio, prompt inclusive.

Trained on just over a million hours of English conversational speech (114 years back-to-back, or ~170 human waking-years), so it picks up multi-speaker dialogue. So it tries to sing back to you in tune to your prosody and inflections. Multi-lingual support is coming in a minor release — let us know which languages you'd like to see.

The dataset was labelled by our own listening model, akro-v1 — transcripts, diarization, and non-speech tags in one pass.

Inline tags work the way you'd expect:

So [breath] the thing about this is, it's not what you'd expect, right?

The advanced panel exposes two conditioning vectors learned at training time: SQ (speech quality, six DNSMOS / NISQA channels) and WPS (words per second). Both are situational — useful when a particular prompt isn't behaving — rather than something to set once and forget. WPS is the obvious one: dial it up or down to control speaking rate. SQ is worth a play when output sounds off in a way the prompt alone can't fix.

~8× realtime streaming on a 4090. Footprint of the full streaming stack: ~5.5 GiB VRAM with Ollama running on the host, ~12 GiB if you co-locate the LLM on the same GPU.

Apple Silicon support is on the way. We've been laying the MLX groundwork and want to ship a first-class M-series build soon — ideally with some help from the community. If you've shipped MLX-backed inference before (especially streaming + KV-cached autoregressive setups), we'd love your hands on it. Drop into the Discord or open an issue on the repo.

Voices & voice cloning

We're a British lab, and we've spent a lot of time being quietly let down by how American the open TTS landscape sounds. Most "British" voices in commercial models are either thin parodies or Mid-Atlantic. So we made authentic British accents a first-class citizen of this release — not an afterthought, not a filter, but actual native speakers in the curated set: abraham (well-spoken British), rhian (traditional British), harry (South London), and maeve (Irish).

Uploading an arbitrary voice prompt may work but won't sound as realistic as the base model: the released weights default toward the speakers they were fine-tuned on. This is deliberate. It's a safety choice, and it's also what makes the released model sound as good as it does in conversation.

Why not Pipecat or LiveKit?

Fair question — voice-agent frameworks like Pipecat and LiveKit Agents already exist. Why ship our own server?

The default shape in those frameworks is a cascade: STT → LLM → TTS, three separate services strung together. The text comes out of ASR, the LLM reads the transcript, the TTS speaks the reply without audio context. That works, but it throws away the half of the signal we care most about — how the user said what they said.

vui takes user audio as context. The LLM hears the breath, the rise and fall, the hesitation before "I dunno…" — not a flattened transcript. As far as we can tell, that puts it in a small group of open models that do this at all; almost everything in those frameworks today that listens to audio is a closed API.

It would take a bit of work to fit vui into either framework — but we'd absolutely love to see it there. If you want to help make that happen, open an issue or PR on the vui repo and we'll be right there with you.

Try it

Apache 2.0. The code, the streaming server, the realtime API, all of it. The model weights have their own terms on the HF model card; the Qwen codec and speaker encoder are governed by Alibaba's licenses.

Responsible use

vui generates speech that sounds convincingly human. We explicitly prohibit using it for fraud, misinformation, deepfakes of real people, harassment, or anything illegal in your jurisdiction. The fine-tuning to a curated voice set is part of how we make those misuses harder — but it isn't a substitute for your own judgment.

If you ship something with it, we'd love to see it.

Finally — there are a lot of things that need to be improved on. Please do share critical feedback; I'll do my best to fix it as soon as I can.

Just want an API?

If you'd rather skip the self-hosting and point your services at a managed endpoint, join the waiting list — we're rolling out hosted access in waves.

Join the waitlist

Need more than the open release?

We run larger, more expressive checkpoints in-house and offer on-prem deployments — air-gapped installs, custom voices, dedicated support, SLAs.

Contact us
#tts#voice-assistant#streaming#open-source#vui