Introduction
TurboLLM is a local-LLM platform that lets you run any local LLM engine, auto-tuned to your GPU — with a polished web UI and an OpenAI/Anthropic-compatible API. Bring your own llama.cpp fork. No compiling. No Electron. No Python. Point Claude Code at your own machine in one command — fully offline.
TurboLLM is the performance & bleeding-edge layer for local LLMs — built for people who today hand-compile forks and hunt forums for the right flags.
Local-LLM tools make two choices for you, and both cost you performance: they pick the engine (you can't use community forks), and they don't tell you what speed to expect (no tuning of the dozens of launch flags that make the difference between 20 and 80 tokens/sec). TurboLLM does the opposite.
Installation
TurboLLM requires Node.js 22 or newer. Install Node from nodejs.org if you don't have it.
# Run without installing (recommended for first try)
npx turbollm
# Or install globally
npm install -g turbollm
turbollm
- Node.js 22 or newer — enforced at startup with a clear message.
- Windows, macOS, or Linux.
- A GPU is recommended but not required — a CPU build is provisioned as a fallback.
- On Windows, the first time the auto-downloaded
llama-serverruns, SmartScreen/Defender may prompt (it's an upstream binary). Allow it once.
Quick Start
That one command starts a local daemon, opens a browser UI, and serves your models over an API any tool can talk to.
npx turbollm
The daemon starts on http://127.0.0.1:6996 and opens your browser
automatically. You're dropped on the Chat screen, ready to load a
model.
First Run
On first run the daemon:
- Detects your GPU and downloads a matching
llama-serverbuild (CUDA for NVIDIA, ROCm for AMD, Metal for Apple, SYCL for Intel, Vulkan otherwise — with a CPU fallback). - Starts on
http://127.0.0.1:6996and opens your browser. - Drops you on the Chat screen, ready to load a model.
Features
📦 Models
Bring your own GGUFs or browse & download from Hugging Face. Quant recommendation per GPU, VRAM-fit verdict, real-time measured t/s, delete-from-disk, and pin favourites.
⚡ Auto-tuning
Auto-benchmark on load, recommended sampling from Hugging Face, real measured tokens/sec, full load-parameter UI, fast by default, multi-GPU per model, saved per-model profiles.
💬 Chat & Agentic
Streaming with stop button, thinking control, markdown + syntax-highlighted code, live artifacts, personas, persistent conversations, per-chat system prompt and sampling, image input, PDF/text attachments, agentic tools with a tool-call approval gate.
🤖 Background Agents
Launch an agent and walk away. Live, reconnectable progress. Cancel anytime, review completed runs later.
🔌 APIs & Integrations
OpenAI-compatible, Anthropic-compatible (including tool use and streaming), structured output via GBNF grammar, API-key auth, gateway that loads models on demand.
🪶 Platform
~0.3 MB npm package, offline-first, no account/backend/internet/telemetry, Windows · macOS · Linux, CPU fallback.
📦 Models — bring your own, or browse Hugging Face
- Use the folders you already have. Point TurboLLM at any directory of GGUFs — your existing LM Studio / Ollama / manual downloads — no re-downloading.
- Browse & download from Hugging Face, in-app: a live, sortable list alongside a permanent detail pane — pick a quant, read the rendered model card, and download with resume + SHA-256 verification.
- Import from any URL — not just Hugging Face.
- Quant recommendation per GPU and a VRAM-fit verdict.
- Primary download folder, real-time measured t/s per model, delete-from-disk, and pin your favourites to the top of the list.
⚡ Auto-tuning & performance
- Auto-benchmark on load derives fast defaults for your exact GPU.
- Recommended sampling from Hugging Face — auto-tune checks a repo's structured params /
generation_config.jsonsidecar first. - Real measured tokens/sec in the model list — live while generating, last-session when idle.
- Full load-parameter UI, a superset of what other tools expose: context length, GPU offload (
-ngl), MoE CPU-offload (--n-cpu-moe), parallel slots, KV-cache quant type, CPU threads, flash attention, and speculative decoding. - Fast by default: flash attention on, NextN self-speculative decoding on for models that carry a draft head.
- Multi-GPU, per model — split a model across cards.
- Saved per-model profiles, per engine — tune once per (model, engine) pair.
- VRAM headroom slider (Settings → Engine, 300 MB–2 GB, default 1 GB) — tell auto-tune how much VRAM to keep free for other GPU workloads (ComfyUI, a browser full of tabs, etc.) instead of a fixed margin.
💬 Chat & agentic tools
- Streaming with a stop button, live tokens/sec, prompt-processing % and prefill t/s.
- Thinking control — toggle reasoning off for a direct answer, or leave it on with collapsible thought blocks.
- Markdown + syntax-highlighted code with one-click copy.
- Live artifacts —
html,svg, andmermaidreplies render as sandboxed, offline previews. - Personas — pick a style per conversation.
- Edit, regenerate, delete, copy any message; persistent, searchable conversations organized into drag-resizable, collapsible folders.
- Per-chat system prompt and per-chat sampling overrides.
- Image input for vision models, PDF and code/text attachments.
- Agentic tools — built-in
web_search,fetch_url, and sandboxedrun_code, plus an MCP marketplace. - Tool-call approval gate — every tool call asks for approval by default (Deny, Allow, Allow for this chat, Always Allow), with per-tool defaults configurable in Developer → Tool permissions.
🤖 Background agents
- Launch an agent and walk away. The Agents screen runs tasks in the daemon, separate from the chat tab.
- Live, reconnectable progress. Watch the run stream in real time; navigate away or reload and the view reconnects.
- Cancel anytime, and review completed runs later.
🔌 APIs & integrations
- OpenAI-compatible
/v1/chat/completions,/v1/embeddings, … - Anthropic-compatible
/v1/messages— including tool use and streaming. - Structured output — constrain any response to a GBNF grammar.
- API-key auth you can require when sharing over a LAN.
- The gateway loads models for you. Name any model in your API request and TurboLLM loads it on the fly.
🎨 Share the GPU with ComfyUI
- The instant ComfyUI starts a render, TurboLLM unloads its model and pauses new loads.
- When ComfyUI's queue drains, TurboLLM reloads the exact model it unloaded.
- Push-based, not polling — ComfyUI signals TurboLLM the moment a job starts/ends.
Engines
No other local-LLM app lets you run whatever inference engine you want. TurboLLM treats the engine as a swappable component.
Add a custom engine
(Engines screen → Add engine)
- Compile or download any
llama-server-compatible binary — stock llama.cpp, a community fork, or your own build. - Point TurboLLM at the folder — it scans for the
llama-serverbinary, runs a capability probe, and learns exactly which flags and features that build supports. - Activate it. The load-parameter UI adapts to that engine.
Any llama-server-compatible binary. This includes stock llama.cpp,
TurboQuant fork, and any community fork that implements the llama-server API.
Auto-Tuning
TurboLLM auto-tunes to your hardware on load. It benchmarks your exact GPU, derives fast defaults, and shows a VRAM-fit verdict before you load — no more flag guessing.
What it tunes
- Context length (
-c) - GPU offload (
-ngl) - MoE CPU-offload (
--n-cpu-moe) - Parallel slots
- KV-cache quant type
- CPU threads
- Flash attention
- Speculative decoding (NextN self-speculative for models with a draft head)
Measured speed
Speed in the model list is measured on your machine from actual generation — live while you chat, and remembered per model.
Same GPU (RTX 5070 Ti 16 GB), same model, same 200K context — measured generation speed.
| Qwen3.6-35B-A3B · 200K | TurboLLM | LM Studio | Speed-up |
|---|---|---|---|
| official llama.cpp — q4_0 | 74.7 t/s | 61.0 t/s | 1.2× |
| official llama.cpp — q8_0 | 72.3 t/s | ~66 t/s | 1.1× |
| TurboQuant fork — turbo4 | 24.6 t/s | 11.4 t/s | 2.2× |
Chat
The Chat screen is the primary interface for interacting with your local models.
Features
- Streaming responses with a stop button
- Live tokens/sec, prompt-processing % and prefill t/s
- Thinking control — toggle reasoning off for a direct answer, or leave it on with collapsible thought blocks
- Markdown + syntax-highlighted code with one-click copy
- Live artifacts —
html,svg, andmermaidreplies render as sandboxed, offline previews - Personas — pick a style per conversation
- Edit, regenerate, delete, copy any message
- Persistent, searchable conversations organized into drag-resizable, collapsible folders
- Per-chat system prompt and per-chat sampling overrides
- Image input for vision models, PDF and code/text attachments
- Tool-call approval gate — a bar above the message box asks Deny / Allow / Allow for this chat / Always Allow before any tool runs, with per-tool defaults in Developer → Tool permissions
Agents
The Agents screen lets you launch tasks and walk away. Tasks run in the daemon, separate from the chat tab.
Features
- Live, reconnectable progress — watch the run stream in real time; navigate away or reload and the view reconnects
- Cancel anytime, and review completed runs later
- Built-in tools —
web_search,fetch_url, and sandboxedrun_code - MCP marketplace — extend with community tools
The chat tool-call approval gate is a chat-only safety prompt — there's no one to ask in a background run. A tool an agent is configured to use runs without a prompt.
API
TurboLLM serves OpenAI and Anthropic-compatible APIs, so any tool can talk to your local models.
OpenAI-compatible
GET /v1/models
GET /v1/models/{model}
POST /v1/chat/completions
POST /v1/embeddings
GET /v1/health
Anthropic-compatible
POST /v1/messages # chat completions
POST /v1/messages?stream=true # streaming chat
Includes tool use and streaming.
Structured output
Constrain any response to a GBNF grammar via the response_format
parameter.
API-key auth
You can require an API key when sharing over a LAN. Set the key in your config or via
the TURBOLLM_API_KEY environment variable.
Gateway mode
Name any model in your API request and TurboLLM loads it on the fly. This is useful for tools that expect a specific model name.
Command-line Interface
turbollm # start on :6996, open browser
turbollm --port 9000 # listen on a specific port
turbollm --no-open # start without opening a browser
turbollm --addr 0.0.0.0:6996 # bind all interfaces (LAN sharing)
turbollm --tunnel --no-open # expose on the internet via a cloudflared quick tunnel
turbollm --stop # stop a running daemon (any terminal)
turbollm launch claude # start Claude Code (auto-loads a model if none is running)
turbollm --help, -h # show usage and exit
| Flag | Description |
|---|---|
--port <n> |
Listen on a specific port (default: 6996) |
--addr <host:port> |
Full host:port override, e.g. 0.0.0.0:6996 for LAN sharing |
--no-open |
Start without opening a browser window |
--tunnel |
Expose this daemon on the internet via a cloudflared quick tunnel (Cloud Launch) — prints the public URL plus a required access token. For running TurboLLM on a rented cloud GPU box. |
--config <file> |
Path to a custom config file |
--stop |
Stop a running TurboLLM daemon (reads ~/.turbollm/daemon.pid) |
--help, -h |
Show usage and exit |
Configuration & Data
Everything lives under ~/.turbollm/ on every OS —
config.json, the SQLite chat database, downloaded engines, models
cache, and logs. Back it up or delete it to reset. Use
--config <file> to point at an alternate config.
Environment variables
| Variable | Description |
|---|---|
TURBOLLM_API_KEY |
Require API-key auth on the API endpoints |
TURBOLLM_CONFIG |
Path to an alternate config file |
Models
TurboLLM supports any GGUF model. You can use folders you already have or browse & download from Hugging Face directly in the app.
Model management
- Use existing folders — point TurboLLM at any directory of GGUFs, no re-downloading
- Browse & download from Hugging Face with resume + SHA-256 verification
- Import from any URL — not just Hugging Face
- Quant recommendation per GPU and a VRAM-fit verdict
- Real-time measured t/s per model
- Delete-from-disk and pin your favourites
Sharing the GPU
Use TurboLLM from any device on your network:
turbollm --addr 0.0.0.0:6996 # bind all interfaces, then open http://<your-ip>:6996
The UI runs in the browser, so any phone, tablet, or laptop on your LAN can use the model on your GPU box.
Cloud Launch — expose over the internet
Running TurboLLM on a rented cloud GPU box instead of your own LAN? --tunnel
opens a cloudflared quick tunnel
and prints a public URL plus a required access token, so only requests carrying that
token are served.
turbollm --tunnel --no-open
ComfyUI Integration
Share your GPU with ComfyUI — when ComfyUI renders, TurboLLM pauses.
How it works
- The instant ComfyUI starts a render, TurboLLM unloads its model and pauses new loads
- When ComfyUI's queue drains, TurboLLM reloads the exact model it unloaded
- Push-based, not polling — ComfyUI signals TurboLLM the moment a job starts/ends
Claude Code
TurboLLM's Anthropic-compatible endpoint means Claude Code can run against whatever model you've loaded — no cloud key, fully offline.
turbollm launch claude # auto-loads a model if none is running
turbollm launch claude --model qwen3-8b # load a specific model first
Develop from Source
npm install # daemon deps
cd web && npm install && cd ..
npm run build:web # build the React UI -> src/webdist
npm run start # run the daemon in dev -> :6996
npm run build # production bundle -> dist/cli.js
node dist/cli.js --port 6996
Stack: Node ≥22 · TypeScript · Hono · node:sqlite
· tsup — and a React 19 + Tailwind v4 + shadcn/ui frontend.
Troubleshooting
| Issue | Solution |
|---|---|
TurboLLM requires Node.js 22 or newer |
Upgrade Node: nodejs.org |
| Model won't load / OOM | Pick a smaller quant, lower GPU offload, or close other GPU apps |
| Windows Defender / SmartScreen prompt | That's the upstream llama-server binary on first run; allow it once |
| Port already in use | turbollm --port 9000 |
| Slow generation | Ensure GPU offload is high and flash attention / NextN are on for supported models |
Privacy
TurboLLM is offline-first: core local use needs no account, no backend, and no internet. No analytics or telemetry are collected. Your prompts, chats, files, and keys never leave your machine.
License
Source-available under the Functional Source License 1.1 (Apache-2.0 future grant) —
SPDX FSL-1.1-ALv2. Free for personal use, internal business use,
education, and research; the only restriction is shipping a competing product.