aigate: Your Own Fucking AI Infrastructure

Here’s what actually happens when you have this running.

You send a prompt to Groq — free tier, no billing, fast as hell. The model decides it needs to look something up. It calls the browser tool. A Camoufox instance opens a real Firefox, moves a real mouse cursor, loads the page, extracts the content. The model reads it, decides it needs to save the result. It calls the storage tool. hybrids3 writes the file, returns a public URL. The model hands you back a structured answer with a link to the saved artifact. One API call. Zero tokens paid.

That’s aigate. Not a chat wrapper. Infrastructure.

The routing is built around one principle: you should never pay for tokens when you could get them free. Six cloud providers — Groq, Cerebras, OpenRouter, HuggingFace, Mistral, Cohere — are free tier. Two more (claudebox, claudebox-zai) are flat-rate subscriptions, so the marginal token costs nothing. Anthropic and OpenAI are in the stack but they’re last. By default you’re burning through free quota before you ever touch a pay-per-token provider.

LiteLLM handles the actual routing. Groq rate-limits you mid-session? Falls to Cerebras. Cerebras is down? OpenRouter. The client doesn’t know or care. It’s still hitting http://localhost:4000 with a standard OpenAI API call.

nginx :4000
├─► /claudebox/            → claudebox API
├─► /claudebox-zai/        → claudebox-zai API
├─► /stealthy-auto-browse/ → HAProxy → [browser ×5]
├─► /storage/              → hybrids3
└─► /                      → LiteLLM
    ├─ Groq (free)
    ├─ Cerebras (free)
    ├─ OpenRouter (free tier)
    ├─ HuggingFace (free)
    ├─ Mistral (free: 1B tokens/month)
    ├─ Cohere (free: 1K req/day)
    ├─ Ollama (local CPU, no limits)
    ├─ Speaches (local CPU, transcription + TTS)
    ├─ claudebox (flat-rate, Max sub or API key)
    ├─ claudebox-zai (flat-rate, z.ai)
    ├─ Anthropic (pay-per-token, optional)
    └─ OpenAI (pay-per-token, optional)

Everything is opt-in. You flip flags in .env: GROQ=1, CEREBRAS=1, BROWSER=1, HYBRIDS3=1. Don’t have an Anthropic key? Don’t set it. The stack adapts to what you’ve got and rebuilds its config accordingly.

34 Tools Any Model Can Use

This is the part that’s actually interesting. aigate isn’t a proxy with a fancy fallback chain — it’s a gateway with infrastructure baked in as function calls. Four MCP servers, 34 tools, wired directly in. Any model with function calling can invoke any of them autonomously, no custom code required.

stealthy_auto_browse — 17 tools

Five Camoufox replicas behind HAProxy. Camoufox is hardened Firefox — real OS-level mouse and keyboard input via PyAutoGUI, zero CDP exposure, persistent fingerprints per session. Passes Cloudflare. Passes CreepJS. Passes BrowserScan. Not “mostly passes” — actually passes, because it’s not detectable as automation in the ways those systems check.

17 tools: navigate, click, type, screenshot, extract content, execute JavaScript, manage cookies, handle file uploads, scroll, wait for elements. The kind of browser automation that doesn’t immediately get 403’d because it doesn’t look like automation.

When a model invokes this, it’s not getting a screenshot back and trying to OCR it. It’s getting structured content, extracted from the live page, that it can actually reason over.

hybrids3 — 7 tools

S3-compatible object storage running locally. The uploads bucket is public-read — files are accessible by direct URL without signing. Auto-expiry. Seven tools: put object, get object, list, delete, generate presigned URLs.

This solves a specific problem in agentic workflows. When a model produces something large — a scraped dataset, a generated file, a rendered report — you don’t want to stuff it back into the context window. You put it in storage, get a URL, pass the URL. The next agent step can fetch it. You can fetch it. It just works.

claudebox — 5 tools

Agentic Claude Code via OAuth or API key. Full shell access. Persistent workspaces. Five tools that let any model in the stack hand off a complex task to a Claude Code agent and get results back.

The practical case: you’re using Groq for speed and cost. Groq hits something that actually needs deep coding work or multi-step reasoning. Groq calls the claudebox tool. Claude Code picks it up, gets a shell, does the work, returns structured results. Back in Groq’s context. The orchestration happens inside the model’s function calling loop — you didn’t write any of that logic.

claudebox_zai — 5 tools

Same architecture, different model. GLM models via z.ai‘s flat-rate subscription. Five tools, same interface. Useful when you want a second agentic Claude Code instance with different pricing or different model characteristics.

The concrete workflow — Groq + browser + storage + claudebox in one call — looks like this from the outside: you send a prompt, you get back a structured answer. What happened in between: the model browsed pages, saved intermediate data to storage, potentially handed work off to a coding agent. All of it autonomous. All of it inside a single API call. The client sees one request and one response.

Local Inference

Two local providers run on CPU, no GPU required.

Ollama runs small models that fit in CPU RAM: llama3.2:3b, qwen3:4b, smollm2:1.7b, qwen2.5-coder:1.5b, qwen2.5-coder:3b, phi3.5, moondream for vision, plus embedding models for RAG. They’re last in the fallback chain — used when cloud providers are unavailable or rate-limited — but you can target them directly if you want inference with zero network dependency. Ollama unloads models after 5 minutes of inactivity so RAM isn’t constantly consumed.

Speaches handles audio. Transcription: faster-distil-whisper-large-v3 for multilingual, parakeet-tdt-0.6b-v2 for English-only at around 3400x real-time on CPU. TTS: Kokoro-82M int8, multiple voices. Same OpenAI-compatible API — /audio/transcriptions, /audio/speech — so existing Whisper API calls work as-is. No API key. No per-minute billing. No sending your audio anywhere.

Security

Internal services — PostgreSQL, Redis, the browser cluster, the storage backend — have no host ports. They’re on isolated Docker networks. Nothing reaches them from outside the stack. The only exposed surface is nginx on port 4000, and that requires bearer token auth.

Every container runs with no-new-privileges:true. If you want the gateway publicly reachable without opening a firewall port, enable CLOUDFLARED=1 and it tunnels through Cloudflare — DDoS protection, TLS termination, no open ports, no IP to scan.

make run validates that any file paths set in .env actually exist before starting Docker. If something’s missing, the stack refuses to start with a clear error. Better than silently mounting a broken directory.

Setup

git clone https://github.com/psyb0t/aigate && cd aigate
cp .env.example .env  # edit with your keys + enable flags
make run-bg

Every variable is documented with comments in .env.example. Enable what you have — flip the flags for the providers and services you want. If you’re on a machine where resources matter, make limits reads your available RAM and CPU and writes recommended limits for every service into .env.limits. MAXUSE=80 make limits caps the whole stack at 80% of system resources if you’re sharing the machine with other workloads.

Gateway comes up at http://localhost:4000. Admin UI at /ui/. OpenAI-compatible — point any existing OpenAI client at it.

# free tier, auto-fallback
curl http://localhost:4000/chat/completions \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
  -d '{"model": "cerebras-qwen3-235b", "messages": [{"role":"user","content":"hello"}]}'
# local, no network, no limits
curl http://localhost:4000/chat/completions \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
  -d '{"model": "local-ollama-llama3.2-3b", "messages": [{"role":"user","content":"hello"}]}'
# local transcription
curl http://localhost:4000/audio/transcriptions \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
  -F "model=local-speaches-parakeet-tdt-0.6b" -F "[email protected]"

82 models. 12 providers. Most of it free. All of it yours.

github.com/psyb0t/aigate