docker-qwenspeak: Text-to-Speech Over SSH Because APIs Are Overrated

Every cloud TTS service is the same story. Send text, get audio, pay per character, pray the API doesn’t change. Your audio goes to someone else’s servers. Your costs scale linearly. And the moment you need a custom voice or want to clone someone’s voice from a 3-second clip, you’re either paying enterprise pricing or you’re fucked.

Qwen3-TTS changed this. Alibaba dropped a legitimately good open-source TTS model that does preset voices with emotion control, voice cloning from reference audio, and voice design from plain English descriptions. All of it runs locally. No API keys, no cloud dependency, no per-character billing.

The problem? Setting it up is a nightmare. Python environments, CUDA dependencies, model downloads, torch version conflicts, FlashAttention compilation — the usual ML infrastructure hellscape. And once you get it running, there’s no good way to use it as a service. You’ve got a Python script on one machine and no clean interface for other machines to hit it.

So I built docker-qwenspeak. All of Qwen3-TTS inside a Docker container, accessible over SSH. Built on top of docker-lockbox — same security model, same sandboxed file operations, same SSH key auth. You pipe a YAML config via stdin, get a job UUID back, and download the audio when it’s done. One container, one command, every voice mode Qwen3-TTS supports.

Setup

First, download the models you need. The speech tokenizer is used by all models, then pick whichever modes you need:

pip install -U "huggingface_hub[cli]"
# Speech tokenizer (used by all models)
huggingface-cli download Qwen/Qwen3-TTS-Tokenizer-12Hz --local-dir ./Qwen3-TTS-Tokenizer-12Hz
# CustomVoice: 9 preset speakers + emotion control
huggingface-cli download Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice --local-dir ./Qwen3-TTS-12Hz-1.7B-CustomVoice
huggingface-cli download Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice --local-dir ./Qwen3-TTS-12Hz-0.6B-CustomVoice
# VoiceDesign: natural language voice descriptions
huggingface-cli download Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign --local-dir ./Qwen3-TTS-12Hz-1.7B-VoiceDesign
# Base: voice cloning from reference audio
huggingface-cli download Qwen/Qwen3-TTS-12Hz-1.7B-Base --local-dir ./Qwen3-TTS-12Hz-1.7B-Base
huggingface-cli download Qwen/Qwen3-TTS-12Hz-0.6B-Base --local-dir ./Qwen3-TTS-12Hz-0.6B-Base

All model directories need to sit flat in the same parent dir — that’s what you pass with -m:

/your/models/dir/
  Qwen3-TTS-Tokenizer-12Hz/
  Qwen3-TTS-12Hz-1.7B-CustomVoice/
  Qwen3-TTS-12Hz-0.6B-CustomVoice/
  Qwen3-TTS-12Hz-1.7B-VoiceDesign/
  Qwen3-TTS-12Hz-1.7B-Base/
  Qwen3-TTS-12Hz-0.6B-Base/

Then install and run:

# One-liner install
curl -fsSL https://raw.githubusercontent.com/psyb0t/docker-qwenspeak/main/install.sh | sudo bash
# Add your SSH key
cat ~/.ssh/id_rsa.pub >> ~/.qwenspeak/authorized_keys
# Start with your models directory
qwenspeak start -d -m /path/to/your/models
# GPU mode (needs NVIDIA Container Toolkit)
qwenspeak start -d -m /path/to/models --processing-unit cuda --gpus 0
# With resource limits
qwenspeak start -d -m /path/to/models --memory 4g --swap 2g --cpus 4
# Other lifecycle commands
qwenspeak stop
qwenspeak status
qwenspeak logs -f
qwenspeak upgrade
qwenspeak uninstall

The installer creates ~/.qwenspeak/ with docker-compose, a CLI wrapper, and a .env that persists all flags between restarts — set the models path once, it sticks.

Optionally add an SSH client config so you don’t have to type the full host/port every time:

Host tts
    HostName your-server
    Port 2222
    User tts

Then you can drop the -p 2222 user@host and just type ssh tts "tts list-speakers". The rest of this article uses the full form for clarity.

How It Works

Everything goes through a single YAML config piped via stdin to the tts command over SSH. The config has global settings and a list of steps. Each step declares a mode, loads the appropriate model, runs all its generations, then unloads. Settings cascade — global, then step, then per-generation — so you set defaults once and override where needed.

Jobs are async. TTS generation is slow, especially on CPU — you don’t want to sit there with an SSH connection open waiting for a 1.7B model to grind through 20 generations. Submit a config, get a UUID back immediately, close the terminal, come back later and download the results. Jobs execute sequentially — one pipeline at a time, new submissions queue up automatically. Up to 50 jobs in the queue by default.

# Get the template to start from
ssh tts@host "tts print-yaml" > job.yaml
# Edit it, then submit — returns immediately with a job UUID
ssh tts@host "tts" < job.yaml # {"id": "550e8400-...", "status": "queued", "total_steps": 3, "total_generations": 7} # Check progress ssh tts@host "tts get-job 550e8400" # Follow the log in real-time ssh tts@host "tts get-job-log 550e8400 -f" # List all jobs ssh tts@host "tts list-jobs" # Cancel a running or queued job ssh tts@host "tts cancel-job 550e8400" # Download when done ssh tts@host "get hello.wav" > hello.wav

Files go in and out through the built-in lockbox file operations — put to upload, get to download. Everything stays sandboxed in /work. Output WAV files land there, you pull them down when the job’s done.

Here’s the full YAML showing every option across all three modes:

# Global settings — apply to all steps unless overridden
dtype: float32             # float32, float16, bfloat16 (float16/bfloat16 GPU only)
models_dir: /models        # where your downloaded models live
flash_attn: auto           # auto-detects; set true/false to override
# Generation defaults — override per-step or per-generation
temperature: 0.9
top_k: 50
top_p: 1.0
repetition_penalty: 1.05
max_new_tokens: 2048
streaming: false
no_sample: false           # true = greedy decoding
steps:
  # preset speakers with optional emotion control
  - mode: custom-voice
    model_size: 1.7b       # 1.7b or 0.6b
    speaker: Ryan          # step default speaker
    language: English      # step default language
    generate:
      - text: "Hello world"
        output: hello.wav
      - text: "I cannot believe this!"
        speaker: Vivian    # override step speaker for this one
        instruct: "Speak angrily"   # emotion/style — 1.7B only
        output: angry.wav
  # describe the voice you want in plain English
  - mode: voice-design
    # model_size is always 1.7b for voice-design
    generate:
      - text: "Welcome to our store."
        instruct: "A warm, friendly young female voice with a cheerful tone"
        language: English
        output: welcome.wav
  # clone a voice from reference audio
  - mode: voice-clone
    model_size: 1.7b       # 1.7b or 0.6b
    ref_audio: ref.wav     # step default reference (prompt reuse when shared)
    ref_text: "Transcript of the reference audio"   # required unless x_vector_only
    language: Auto
    generate:
      - text: "First line in cloned voice"
        output: clone1.wav
      - text: "Second line"
        output: clone2.wav
      - text: "Override ref for this generation"
        ref_audio: other.wav    # different reference for this one
        x_vector_only: true     # use speaker embedding only, no transcript needed
        output: clone3.wav

Three Ways To Get a Voice

Three modes, each with different tradeoffs depending on what you need.

Preset Voices With Emotion

Nine built-in speakers across Chinese, English, Japanese, and Korean. The 1.7B model supports emotion and style control via the instruct field — tell Ryan to sound excited, tell Vivian to sound angry, tell Sohee to whisper. The 0.6B model is smaller and faster but drops emotion control.

Check what’s available with ssh tts@host "tts list-speakers" — it prints the full speaker table with gender, language, and description.

Voice Design

Describe the voice you want in plain English via instruct and the model generates it. instruct is required for this mode — it’s the whole point. No reference audio, no preset selection. Only available as 1.7B.

Voice Cloning

Give it a 3-second audio sample and a transcript, and it clones the voice. Upload the reference file with put, point ref_audio at it, generate as many lines as you want. Setting ref_audio and ref_text at the step level shares the voice prompt across all generations in that step — efficient and consistent.

For emotion from cloned voices: upload reference files with different emotional takes and use separate steps. The model picks up tone from the reference audio, so a happy sample gives you a happy cloned voice.

ssh tts@host "create-dir refs"
ssh tts@host "put refs/happy.wav" < me_happy.wav
ssh tts@host "put refs/angry.wav" < me_angry.wav

steps:
  - mode: voice-clone
    ref_audio: refs/happy.wav
    ref_text: "transcript of happy ref"
    generate:
      - text: "Great news everyone!"
        output: happy1.wav
      - text: "I'm so glad to hear that"
        output: happy2.wav
  - mode: voice-clone
    ref_audio: refs/angry.wav
    ref_text: "transcript of angry ref"
    generate:
      - text: "This is unacceptable"
        output: angry1.wav

CPU vs GPU

It runs on both. CPU mode uses float32 and works on any machine — just expect the 1.7B model to take a while. GPU mode with CUDA enables float16/bfloat16 and includes FlashAttention-2 for faster inference.

Memory reality check:

0.6B float32 — ~2.4GB, fits in 4GB RAM
1.7B float32 — ~7GB weights, needs 10GB+
1.7B bfloat16 (GPU) — ~3.5GB weights, fits in 6GB VRAM
float16 on CPU — don’t. It produces inf/nan garbage

FlashAttention auto-enables on GPU. If your dtype is float32, it auto-switches to bfloat16 because FlashAttention requires half-precision. This is handled transparently — you don’t need to think about it.

Logging

All pipeline output is logged to /var/log/tts/ inside the container. Mount it as a volume if you want to access logs from the host. Two files are maintained: tts.log (current day, truncated at midnight) and YYYY_MM_DD_tts.log daily archives. Old logs are auto-cleaned based on TTS_LOG_RETENTION (default: 7 days).

# View last 20 lines
ssh tts@host "tts log"
# View last 100 lines
ssh tts@host "tts log -n 100"
# Follow (like tail -f)
ssh tts@host "tts log -f"

Built on Lockbox

The entire SSH layer, security model, and file operations come from docker-lockbox. That means:

SSH key auth only — no passwords
No shell access — every command goes through the Python wrapper
No injection — shell metacharacters are meaningless
All file operations sandboxed to /work
UID/GID matching so files have correct ownership on the host

The only allowed command beyond the built-in file ops is tts. That’s the entire attack surface — one binary that accepts YAML via stdin and writes WAV files to /work.

The Bottom Line

Local TTS that actually sounds good, runs in a container, and is accessible over SSH. No API keys, no cloud costs, no Python environment hell. Download the models, start the container, pipe YAML, get audio.

Nine preset voices with emotion control. Voice cloning from a 3-second sample. Voice design from a text description. Ten languages. CPU or GPU. Async job queue so you don’t sit around waiting.

Go grab it: github.com/psyb0t/docker-qwenspeak

Licensed under WTFPL — because restricting how people generate robot voices is a weird hill to die on.