aigate: Your Own Fucking AI Infrastructure

A 3060 and no cloud budget. AI models behind one endpoint — six cloud providers on free tier, five running locally, the rest on flat-rate subs or pay-per-token as last resort.
Not in theory. Right now. aigate is a Docker stack behind a single nginx port. Text generation, image generation, speech synthesis, transcription, web search, browser automation, object storage, agentic code execution, your Telegram account as an MCP tool, an async job queue, and a web UI. On hardware that costs less than one month of an OpenAI API bill.
The whole thing is OpenAI-compatible. Point any client at http://localhost:4000 and it works. Existing code, existing SDKs, existing tools — they all just talk to it like it’s OpenAI. It isn’t. It’s yours.

Models — Most of Them Free

Six of them — Groq, Cerebras, OpenRouter, HuggingFace, Mistral, Cohere — are completely free. No credit card, no billing, no gotcha. Two more — claudebox and claudebox-zai — run on flat-rate subscriptions, so the marginal token costs nothing. Anthropic and OpenAI are in the stack but they’re last resort. Five more providers run locally on your own hardware — no network, no rate limits, no cost at all.
The routing philosophy is simple: never pay for a token you could get free.
LiteLLM handles it. You request groq-llama-3.3-70b. Groq rate-limits you. LiteLLM silently falls to cerebras-qwen3-235b. Cerebras is down. Falls to mistral-small. Mistral responds. You get the answer. The client never knew anything happened. Every model has a fallback chain — free cloud first, then flat-rate, then pay-per-token, then local. The chain is rebuilt every time you start the stack, filtered to only the providers you’ve actually enabled.

groq-llama-3.3-70b → 429 rate limited
  ↓ fallback
cerebras-qwen3-235b → 503 unavailable
  ↓ fallback
mistral-small → 200 ✓

You didn’t write retry logic. You didn’t write fallback logic. You sent one request and got one response.

The Architecture

nginx :4000
├─► /claudebox/            → claudebox
├─► /claudebox-zai/        → claudebox-zai
├─► /stealthy-auto-browse/ → HAProxy → [browser ×5]
├─► /storage/              → hybrids3
├─► /q/                    → proxq → LiteLLM (async, returns job ID)
├─► /librechat/            → LibreChat (web UI)
├─► /searxng/              → SearXNG (meta-search)
├─► /telethon/             → telethon-plus (your Telegram account)
└─► /                      → LiteLLM (sync)
    ├─ Groq              (free, GROQ=1)
    ├─ Cerebras          (free, CEREBRAS=1)
    ├─ OpenRouter        (free tier, OPENROUTER=1)
    ├─ HuggingFace       (free, HUGGINGFACE=1)
    ├─ Mistral           (free: 1B tokens/month, MISTRAL=1)
    ├─ Cohere            (free: 1K req/day, COHERE=1)
    ├─ Ollama CPU        (local, OLLAMA=1)
    ├─ Ollama CUDA       (local, NVIDIA, OLLAMA_CUDA=1)
    ├─ Speaches CPU      (local, transcription + TTS, SPEACHES=1)
    ├─ Speaches CUDA     (local, CUDA STT, SPEACHES_CUDA=1)
    ├─ Qwen3 CUDA TTS   (local, CUDA voice-cloning, QWEN_TTS_CUDA=1)
    ├─ sd.cpp CPU        (local, image gen, SDCPP=1)
    ├─ sd.cpp CUDA       (local, image gen, SDCPP_CUDA=1)
    ├─ claudebox         (flat-rate, CLAUDEBOX=1)
    ├─ claudebox-zai     (flat-rate, CLAUDEBOX_ZAI=1)
    ├─ Anthropic         (pay-per-token, ANTHROPIC=1)
    └─ OpenAI            (pay-per-token, OPENAI=1)
MCP servers (all optional):
  ├─ stealthy_auto_browse  — multi-step browser automation (BROWSER=1)
  ├─ hybrids3              — object storage: upload, download, list, presign (HYBRIDS3=1)
  ├─ claudebox             — agentic Claude Code via OAuth or API key (CLAUDEBOX=1)
  ├─ claudebox_zai         — agentic Claude Code via z.ai/GLM (CLAUDEBOX_ZAI=1)
  ├─ telethon              — your Telegram account as a tool (TELETHON=1)
  └─ mcp_tools             — generate_image + generate_tts + search_web (auto-enabled)

Everything is opt-in. Flip flags in .env. Don’t have an Anthropic key? Don’t set it. Only have a CPU? Skip the CUDA flags. The stack adapts to what you’ve got and rebuilds its config accordingly.

Tools Any Model Can Call

This is what makes aigate more than a proxy with a fallback chain. A whole pile of MCP servers, dozens of tools. Any model with function calling can invoke any of them autonomously. You prompt. The model decides what tools it needs. You get results.
The concrete workflow: you send a prompt to Groq — free, fast. The model decides it needs to research something. It calls search_web — SearXNG queries Google, Bing, DuckDuckGo simultaneously, returns results. The model needs more detail on one result. It calls the browser tool. A Camoufox instance opens real Firefox, moves a real mouse cursor, loads the page, extracts content. The model reads it, decides to save the result. Calls the storage tool. hybrids3 writes the file, returns a public URL. The model decides it needs to generate an image for the report. Calls generate_image. stable-diffusion.cpp renders it locally, uploads to storage, returns a URL. The model hands you a structured answer with links to everything it produced. One API call. Zero tokens paid. The client saw one request and one response.

stealthy_auto_browse

Five Camoufox replicas behind HAProxy. Camoufox is hardened Firefox — real OS-level mouse and keyboard input via PyAutoGUI, zero CDP exposure, persistent fingerprints per session. Passes Cloudflare. Passes CreepJS. Passes BrowserScan. Passes Pixelscan. Not “mostly passes” — actually passes, because it’s not detectable as automation in the ways those systems check.
One tool: run_script. Multi-step scripts — navigate, click, type, extract, screenshot, scroll, wait for elements, execute JavaScript. The model writes the script, the browser executes it, you get back structured data from the live page.

hybrids3

S3-compatible object storage running locally. The uploads bucket is public-read — files accessible by direct URL without signing. Auto-expiry. Tools for put, get, list, delete, info, presign, list buckets.
This solves a specific problem in agentic workflows. When a model produces something large — a scraped dataset, a generated image, a rendered report — you don’t stuff it into the context window. You put it in storage, get a URL, pass the URL. The next step can fetch it. You can fetch it. Context stays clean.

claudebox — two instances

Agentic Claude Code in a container. Full shell access. Persistent workspaces. File I/O. Tool use. One instance on your Claude subscription or API key, one running GLM models through z.ai.
You’re using Groq for speed. Groq hits something that needs deep coding work. Groq calls the claudebox tool. Claude Code picks it up, gets a shell, writes code, runs tests, returns structured results. Back in Groq’s context. The orchestration happens inside the model’s function calling loop — you didn’t write any of that logic.

mcp_tools

generate_image, generate_tts, and search_web. Auto-enabled when any image, TTS, or search provider is active. All return structured JSON — generated files are uploaded to hybrids3 automatically with persistent URLs. No base64 blobs in the context window.
Image generation routes through FLUX, DALL-E, or stable-diffusion.cpp depending on what’s enabled. TTS routes through Kokoro (CPU), Qwen3-TTS with voice cloning (CUDA), or OpenAI TTS. Web search queries SearXNG — Google, Bing, DuckDuckGo, Wikipedia in parallel, no API key needed. The tools discover available models dynamically from LiteLLM — they always reflect what’s actually running.

telethon

Your actual Telegram account as a tool. Not the Bot API — full MTProto, same access you have on your phone. Read messages, send messages, list dialogs, manage groups, forward content, edit, delete, mark read, send files. The model decides when. The agent acts as you.
A free-tier Groq model that searches the web, scrapes a page, generates an image, and sends the result to your Saved Messages — zero tokens paid, one conversation, multiple tool calls. Or a cron job that summarizes your work group’s last 24 hours into a digest and DMs it to you every morning. Your account, programmable.
Powered by telethon-plus. Flip TELETHON=1, set TELETHON_API_ID / TELETHON_API_HASH / TELETHON_SESSION in .env. The session string is full account access — treat it like the password it effectively is.

Local Inference — Slower, But Yours

Five local providers. Three of them don’t need a GPU.
This is the “poor people’s computer” part. Your old laptop with 16GB of RAM can run text generation, image generation, transcription, and speech synthesis. No API key. No network. No rate limit. No bill. It’s slower than cloud. Sometimes much slower. But it’s yours, it’s private, and it works offline.
Ollama CPU — llama3.2:3b, qwen3:4b, smollm2:1.7b, qwen2.5-coder:1.5b, qwen2.5-coder:3b, phi4-mini (reasoning, 128K context), gemma4:e2b (vision), gemma3:4b (vision), nuextract-v1.5 (structured text → JSON extraction), dolphin-phi, plus two embedding models for RAG. The smallest needs 1GB of RAM. Models auto-download on first start. Unload after 5 minutes of idle.
Ollama CUDA — all CPU models GPU-accelerated, plus heavier ones: qwen3:8b, gemma4:e4b (vision), deepseek-coder-v2:16b (MoE, 160K context), deepseek-r1:8b (reasoning), qwen3-abliterated:16b (uncensored), gemma4-abliterated:e4b (uncensored vision), qwen2.5-coder:7b, llama3.1:8b. Flash attention, quantized KV cache. Shares model storage with CPU Ollama — no duplicate downloads. Each CUDA service has its own flag: OLLAMA_CUDA=1.
Speaches — audio on CPU and CUDA. Transcription: faster-distil-whisper-large-v3 for multilingual, parakeet-tdt-0.6b-v2 for English-only at ~3400x real-time on CPU. TTS on CPU: Kokoro-82M int8, multiple voices. TTS on CUDA: Qwen3-TTS-0.6B with voice cloning via reference audio. Same OpenAI-compatible API — existing Whisper calls work as-is. Independent flags: SPEACHES=1, SPEACHES_CUDA=1, QWEN_TTS_CUDA=1.
stable-diffusion.cpp — local image generation. CPU runs sd-turbo and sdxl-turbo out of the box. SDCPP_CUDA=1 for hardware acceleration and the full model set: sd-turbo, sdxl-turbo, sdxl-lightning, flux-schnell, juggernaut-xi. Models download on first use and cache locally. The OpenAI-compatible /images/generations endpoint means existing code works as-is.
Local models sit at the end of the fallback chain by default. Cloud fails? Local picks it up. Or you target them directly: "model": "local-ollama-cpu-llama3.2-3b". Zero network. Zero cost. Slower, but it answers.

One GPU, Everything

Here’s the engineering problem: you have one GPU. Ollama wants VRAM for the LLM. sd.cpp wants VRAM for image generation. Speaches wants VRAM for transcription. Qwen3-TTS wants VRAM for speech synthesis. Load them all and you OOM.
The resource manager solves this automatically. A LiteLLM callback enforces mutual exclusion per hardware — one CUDA job at a time. When an image generation request arrives while an LLM is loaded, the resource manager acquires the semaphore, unloads the LLM, then lets the image generation proceed. When a TTS request comes after that, it unloads the image generator first. Same logic on CPU.
Every service has its own unload API. Ollama: keep_alive: 0. sd.cpp: /sdcpp/v1/unload. Speaches: DELETE /api/ps/{model}. Qwen3-TTS: /unload. The resource manager knows all of them and calls the right one.
You don’t manage any of this. You send requests. The platform juggles VRAM automatically. The only cost is latency — the first request after a swap includes model load time. After that, it’s fast until the idle timeout (5 minutes by default) unloads the model to free memory for the next thing.
Each CUDA service has its own flag — OLLAMA_CUDA=1, SDCPP_CUDA=1, SPEACHES_CUDA=1, QWEN_TTS_CUDA=1 — so you enable exactly the GPU services your hardware can handle. One GPU. Text, images, audio in, audio out. All of it works. Just not all at the same time.

Web Search

SearXNG at /searxng/. Self-hosted meta-search — queries Google, Bing, DuckDuckGo, and Wikipedia simultaneously. No API key. Runs entirely locally.
The MCP search_web tool means any function-calling model can search the web autonomously. The model decides it needs to look something up, calls the tool, gets results, continues reasoning. You didn’t build a search integration. You flipped SEARXNG=1.

LibreChat — The Daily Driver

Everything above works from the API. But for daily use, there’s LibreChat at /librechat/.
All models in the dropdown. All MCP tools wired in. Pick a model, start talking. The model can still invoke the browser, storage, claudebox, image generation, TTS, and web search autonomously — everything available from the API is available in the UI. Conversation history backed by MongoDB. File uploads. WebSocket streaming.
First registered user becomes admin. Set LIBRECHAT_ALLOW_REGISTRATION=false after that and you’re the only one in.
Enable it: LIBRECHAT=1 in .env.

Async Queue

Long inference requests time out. Hit /q/ instead of / and the request goes into a Redis-backed queue. You get a job ID back instantly. The actual inference runs in the background. Poll for status, fetch the result when it’s done.

# submit — returns 202 immediately
curl http://localhost:4000/q/v1/chat/completions \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
  -d '{"model": "cerebras-qwen3-235b", "messages": [{"role":"user","content":"write a novel"}]}'
# → {"jobId": "550e8400-e29b-41d4-a716-446655440000"}
# check status
curl http://localhost:4000/q/__jobs/550e8400-e29b-41d4-a716-446655440000 \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY"
# get the result
curl http://localhost:4000/q/__jobs/550e8400-e29b-41d4-a716-446655440000/content \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY"

Configurable concurrency, retention, timeouts, retries, and response caching. Only OpenAI API paths get queued — health checks and admin requests pass through directly.

Security

Internal services — PostgreSQL, MongoDB, Redis, the browser cluster, the storage backend — have no host ports. They’re on isolated Docker networks. Nothing reaches them from outside the stack. The only exposed surface is nginx on port 4000, and that requires bearer token auth.
Every container runs with no-new-privileges:true. make run validates that any file paths in .env actually exist before starting — no silently broken volume mounts.
Want it publicly reachable without opening a firewall port? CLOUDFLARED=1 — Cloudflare Tunnel. DDoS protection, TLS termination, no open ports, no IP to scan. Quick tunnel for a random *.trycloudflare.com URL, or a named tunnel for a fixed domain.
Don’t want it public at all, but still want to reach it from your laptop, phone, or another box? TAILSCALE=1 with TS_AUTHKEY=tskey-auth-... and a TS_HOSTNAME. A Tailscale sidecar joins your tailnet and runs tailscale serve in L4 TCP forwarding mode straight to nginx:4000 — no Host-header matching, no FQDN configuration on the tailscale side, no HTTPS auto-cert (TLS termination, if you want it, lives in nginx). nginx receives the original request bytes as-is and routes them through its existing vhost/path logic. Access via http://<TS_HOSTNAME>.<tailnet>.ts.net — plain HTTP is fine here because WireGuard already encrypts every byte inside the tailnet. Works with hosted Tailscale or self-hosted Headscale (use TS_EXTRA_ARGS=--login-server=...). State persists in .data/tailscale/ so reboots reuse the existing login. Combine it with the bearer token auth and you’ve got two completely independent layers of access control.
v1.5 also renamed the internal Docker networks from litellm-internal / litellm-public to aigate-internal / aigate-public — LiteLLM is one of many services and the network names should reflect the stack, not one component. Internal change only, no impact on API surface or env vars. If you’ve got external tooling that joined the old networks by name, that’s the only thing you need to update.
Hundreds of tests. Health checks, routing, auth, MCP tool validation, storage CRUD, browser automation, claudebox agentic runs, async job lifecycle, local TTS/STT round-trips, CUDA resource manager verification, local image generation, MCP-to-sdcpp integration, LLM-to-MCP end-to-end tool calling, SearXNG search, Telethon MTProto round-trips against a real account. Plus security: cross-token isolation, session hijack attempts, HTTP request smuggling (CL.TE/TE.CL), h2c smuggling, SSRF via browser and MCP to internal services, prompt injection key extraction, path traversal, S3 presign abuse, stored XSS, model name injection, header injection, Docker socket isolation. This isn’t a hobby project test suite. This is paranoia as a feature.

Setup

git clone https://github.com/psyb0t/aigate && cd aigate
cp .env.example .env  # edit: add keys, flip flags
make run-bg

Every variable is documented in .env.example. Enable what you have, ignore what you don’t.
If resources matter — and on a normal computer they do — make limits reads your available RAM and CPU and writes recommended limits for every service. MAXUSE=80 make limits caps the whole stack at 80% of system resources if you’re sharing the machine with other workloads. CUDA services are resource-manager-aware, so the budget counts the largest one — not all four at full allocation.

# free tier, auto-fallback
curl http://localhost:4000/chat/completions \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
  -d '{"model": "cerebras-qwen3-235b", "messages": [{"role":"user","content":"hello"}]}'
# local, no network, no limits
curl http://localhost:4000/chat/completions \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
  -d '{"model": "local-ollama-cpu-llama3.2-3b", "messages": [{"role":"user","content":"hello"}]}'
# image generation
curl http://localhost:4000/images/generations \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
  -d '{"model": "hf-flux-schnell", "prompt": "a cat riding a skateboard"}'
# local image generation (no network, no cost)
curl http://localhost:4000/images/generations \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
  -d '{"model": "local-sdcpp-cpu-sd-turbo", "prompt": "a red panda in a forest", "size": "512x512"}'
# transcription (~3400x real-time on CPU)
curl http://localhost:4000/audio/transcriptions \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
  -F "model=local-speaches-parakeet-tdt-0.6b" -F "[email protected]"
# text-to-speech (local, multiple voices)
curl http://localhost:4000/audio/speech \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
  -d '{"model": "local-speaches-kokoro-tts", "input": "Hello world", "voice": "af_heart"}' \
  -o speech.mp3
# web search (no API key, self-hosted)
curl http://localhost:4000/chat/completions \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
  -d '{"model": "groq-qwen3-32b", "messages": [{"role":"user","content":"search the web for latest rust release notes"}]}'
# async — submit and poll
curl http://localhost:4000/q/v1/chat/completions \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
  -d '{"model": "cerebras-qwen3-235b", "messages": [{"role":"user","content":"write a novel"}]}'

Six providers free, five local, the rest as fallback. Slower on a normal computer — but it runs, it’s private, and nobody can rate-limit you out of your own infrastructure.
github.com/psyb0t/aigate