TurboLLM — Documentation

Introduction

TurboLLM is a local-LLM platform that lets you run any local LLM engine, auto-tuned to your GPU — with a polished web UI and an OpenAI/Anthropic-compatible API. Bring your own llama.cpp fork. No compiling. No Electron. No Python. Point Claude Code at your own machine in one command — fully offline.

TurboLLM is the performance & bleeding-edge layer for local LLMs — built for people who today hand-compile forks and hunt forums for the right flags.

Why TurboLLM?

Local-LLM tools make two choices for you, and both cost you performance: they pick the engine (you can't use community forks), and they don't tell you what speed to expect (no tuning of the dozens of launch flags that make the difference between 20 and 80 tokens/sec). TurboLLM does the opposite.

Installation

TurboLLM requires Node.js 22 or newer. Install Node from nodejs.org if you don't have it.

# Run without installing (recommended for first try)
npx turbollm

# Or install globally
npm install -g turbollm
turbollm

Requirements

Node.js 22 or newer — enforced at startup with a clear message.
Windows, macOS, or Linux.
A GPU is recommended but not required — a CPU build is provisioned as a fallback.
On Windows, the first time the auto-downloaded llama-server runs, SmartScreen/Defender may prompt (it's an upstream binary). Allow it once.

Quick Start

That one command starts a local daemon, opens a browser UI, and serves your models over an API any tool can talk to.

npx turbollm

The daemon starts on http://127.0.0.1:6996 and opens your browser automatically. You're dropped on the Chat screen, ready to load a model.

First Run

On first run the daemon:

Detects your GPU and downloads a matching llama-server build (CUDA for NVIDIA, ROCm for AMD, Metal for Apple, SYCL for Intel, Vulkan otherwise — with a CPU fallback).
Starts on http://127.0.0.1:6996 and opens your browser.
Drops you on the Chat screen, ready to load a model.

Features

📦 Models

Bring your own GGUFs or browse & download from Hugging Face. Quant recommendation per GPU, VRAM-fit verdict, real-time measured t/s, delete-from-disk, and pin favourites.

⚡ Auto-tuning

Auto-benchmark on load, recommended sampling from Hugging Face, real measured tokens/sec, full load-parameter UI, fast by default, multi-GPU per model, saved per-model profiles.

💬 Chat & Agentic

Streaming with stop button, thinking control, markdown + syntax-highlighted code, live artifacts, personas, persistent conversations, per-chat system prompt and sampling, image input, PDF/text attachments, agentic tools with a tool-call approval gate.

🤖 Background Agents

Launch an agent and walk away. Live, reconnectable progress. Cancel anytime, review completed runs later.

🔌 APIs & Integrations

OpenAI-compatible, Anthropic-compatible (including tool use and streaming), structured output via GBNF grammar, API-key auth, gateway that loads models on demand.

🪶 Platform

~0.3 MB npm package, offline-first, no account/backend/internet/telemetry, Windows · macOS · Linux, CPU fallback.

📦 Models — bring your own, or browse Hugging Face

Use the folders you already have. Point TurboLLM at any directory of GGUFs — your existing LM Studio / Ollama / manual downloads — no re-downloading.
Browse & download from Hugging Face, in-app: a live, sortable list alongside a permanent detail pane — pick a quant, read the rendered model card, and download with resume + SHA-256 verification.
Import from any URL — not just Hugging Face.
Quant recommendation per GPU and a VRAM-fit verdict.
Primary download folder, real-time measured t/s per model, delete-from-disk, and pin your favourites to the top of the list.

⚡ Auto-tuning & performance

Auto-benchmark on load derives fast defaults for your exact GPU.
Recommended sampling from Hugging Face — auto-tune checks a repo's structured params / generation_config.json sidecar first.
Real measured tokens/sec in the model list — live while generating, last-session when idle.
Full load-parameter UI, a superset of what other tools expose: context length, GPU offload (-ngl), MoE CPU-offload (--n-cpu-moe), parallel slots, KV-cache quant type, CPU threads, flash attention, and speculative decoding.
Fast by default: flash attention on, NextN self-speculative decoding on for models that carry a draft head.
Multi-GPU, per model — split a model across cards.
Saved per-model profiles, per engine — tune once per (model, engine) pair.
VRAM headroom slider (Settings → Engine, 300 MB–2 GB, default 1 GB) — tell auto-tune how much VRAM to keep free for other GPU workloads (ComfyUI, a browser full of tabs, etc.) instead of a fixed margin.

💬 Chat & agentic tools

Streaming with a stop button, live tokens/sec, prompt-processing % and prefill t/s.
Thinking control — toggle reasoning off for a direct answer, or leave it on with collapsible thought blocks.
Markdown + syntax-highlighted code with one-click copy.
Live artifacts — html, svg, and mermaid replies render as sandboxed, offline previews.
Personas — pick a style per conversation.
Edit, regenerate, delete, copy any message; persistent, searchable conversations organized into drag-resizable, collapsible folders.
Per-chat system prompt and per-chat sampling overrides.
Image input for vision models, PDF and code/text attachments.
Agentic tools — built-in web_search, fetch_url, and sandboxed run_code, plus an MCP marketplace.
Tool-call approval gate — every tool call asks for approval by default (Deny, Allow, Allow for this chat, Always Allow), with per-tool defaults configurable in Developer → Tool permissions.

🤖 Background agents

Launch an agent and walk away. The Agents screen runs tasks in the daemon, separate from the chat tab.
Live, reconnectable progress. Watch the run stream in real time; navigate away or reload and the view reconnects.
Cancel anytime, and review completed runs later.

🔌 APIs & integrations

OpenAI-compatible /v1/chat/completions, /v1/embeddings, …
Anthropic-compatible /v1/messages — including tool use and streaming.
Structured output — constrain any response to a GBNF grammar.
API-key auth you can require when sharing over a LAN.
The gateway loads models for you. Name any model in your API request and TurboLLM loads it on the fly.

🎨 Share the GPU with ComfyUI

The instant ComfyUI starts a render, TurboLLM unloads its model and pauses new loads.
When ComfyUI's queue drains, TurboLLM reloads the exact model it unloaded.
Push-based, not polling — ComfyUI signals TurboLLM the moment a job starts/ends.

Engines

No other local-LLM app lets you run whatever inference engine you want. TurboLLM treats the engine as a swappable component.

Add a custom engine

(Engines screen → Add engine)

Compile or download any llama-server-compatible binary — stock llama.cpp, a community fork, or your own build.
Point TurboLLM at the folder — it scans for the llama-server binary, runs a capability probe, and learns exactly which flags and features that build supports.
Activate it. The load-parameter UI adapts to that engine.

Supported engines

Any llama-server-compatible binary. This includes stock llama.cpp, TurboQuant fork, and any community fork that implements the llama-server API.

Auto-Tuning

TurboLLM auto-tunes to your hardware on load. It benchmarks your exact GPU, derives fast defaults, and shows a VRAM-fit verdict before you load — no more flag guessing.

What it tunes

Context length (-c)
GPU offload (-ngl)
MoE CPU-offload (--n-cpu-moe)
Parallel slots
KV-cache quant type
CPU threads
Flash attention
Speculative decoding (NextN self-speculative for models with a draft head)

Measured speed

Speed in the model list is measured on your machine from actual generation — live while you chat, and remembered per model.

Speed comparison: TurboLLM vs LM Studio

Same GPU (RTX 5070 Ti 16 GB), same model, same 200K context — measured generation speed.

Qwen3.6-35B-A3B · 200K	TurboLLM	LM Studio	Speed-up
official llama.cpp — q4_0	74.7 t/s	61.0 t/s	1.2×
official llama.cpp — q8_0	72.3 t/s	~66 t/s	1.1×
TurboQuant fork — turbo4	24.6 t/s	11.4 t/s	2.2×

Chat

The Chat screen is the primary interface for interacting with your local models.

Features

Streaming responses with a stop button
Live tokens/sec, prompt-processing % and prefill t/s
Thinking control — toggle reasoning off for a direct answer, or leave it on with collapsible thought blocks
Markdown + syntax-highlighted code with one-click copy
Live artifacts — html, svg, and mermaid replies render as sandboxed, offline previews
Personas — pick a style per conversation
Edit, regenerate, delete, copy any message
Persistent, searchable conversations organized into drag-resizable, collapsible folders
Per-chat system prompt and per-chat sampling overrides
Image input for vision models, PDF and code/text attachments
Tool-call approval gate — a bar above the message box asks Deny / Allow / Allow for this chat / Always Allow before any tool runs, with per-tool defaults in Developer → Tool permissions

Agents

The Agents screen lets you launch tasks and walk away. Tasks run in the daemon, separate from the chat tab.

Features

Live, reconnectable progress — watch the run stream in real time; navigate away or reload and the view reconnects
Cancel anytime, and review completed runs later
Built-in tools — web_search, fetch_url, and sandboxed run_code
MCP marketplace — extend with community tools

Approval gate doesn't apply here

The chat tool-call approval gate is a chat-only safety prompt — there's no one to ask in a background run. A tool an agent is configured to use runs without a prompt.

API

TurboLLM serves OpenAI and Anthropic-compatible APIs, so any tool can talk to your local models.

OpenAI-compatible

GET  /v1/models
GET  /v1/models/{model}
POST /v1/chat/completions
POST /v1/embeddings
GET  /v1/health

Anthropic-compatible

POST /v1/messages            # chat completions
POST /v1/messages?stream=true  # streaming chat

Includes tool use and streaming.

Structured output

Constrain any response to a GBNF grammar via the response_format parameter.

API-key auth

You can require an API key when sharing over a LAN. Set the key in your config or via the TURBOLLM_API_KEY environment variable.

Gateway mode

Name any model in your API request and TurboLLM loads it on the fly. This is useful for tools that expect a specific model name.

Command-line Interface

turbollm                        # start on :6996, open browser
turbollm --port 9000            # listen on a specific port
turbollm --no-open              # start without opening a browser
turbollm --addr 0.0.0.0:6996    # bind all interfaces (LAN sharing)
turbollm --tunnel --no-open     # expose on the internet via a cloudflared quick tunnel
turbollm --stop                 # stop a running daemon (any terminal)
turbollm launch claude          # start Claude Code (auto-loads a model if none is running)
turbollm --help, -h             # show usage and exit

Flag	Description
`--port <n>`	Listen on a specific port (default: `6996`)
`--addr <host:port>`	Full host:port override, e.g. `0.0.0.0:6996` for LAN sharing
`--no-open`	Start without opening a browser window
`--tunnel`	Expose this daemon on the internet via a cloudflared quick tunnel (Cloud Launch) — prints the public URL plus a required access token. For running TurboLLM on a rented cloud GPU box.
`--config <file>`	Path to a custom config file
`--stop`	Stop a running TurboLLM daemon (reads `~/.turbollm/daemon.pid`)
`--help`, `-h`	Show usage and exit

Configuration & Data

Everything lives under ~/.turbollm/ on every OS — config.json, the SQLite chat database, downloaded engines, models cache, and logs. Back it up or delete it to reset. Use --config <file> to point at an alternate config.

Environment variables

Variable	Description
`TURBOLLM_API_KEY`	Require API-key auth on the API endpoints
`TURBOLLM_CONFIG`	Path to an alternate config file

Models

TurboLLM supports any GGUF model. You can use folders you already have or browse & download from Hugging Face directly in the app.

Model management

Use existing folders — point TurboLLM at any directory of GGUFs, no re-downloading
Browse & download from Hugging Face with resume + SHA-256 verification
Import from any URL — not just Hugging Face
Quant recommendation per GPU and a VRAM-fit verdict
Real-time measured t/s per model
Delete-from-disk and pin your favourites

Sharing the GPU

Use TurboLLM from any device on your network:

turbollm --addr 0.0.0.0:6996    # bind all interfaces, then open http://<your-ip>:6996

The UI runs in the browser, so any phone, tablet, or laptop on your LAN can use the model on your GPU box.

Cloud Launch — expose over the internet

Running TurboLLM on a rented cloud GPU box instead of your own LAN? --tunnel opens a cloudflared quick tunnel and prints a public URL plus a required access token, so only requests carrying that token are served.

turbollm --tunnel --no-open

ComfyUI Integration

Share your GPU with ComfyUI — when ComfyUI renders, TurboLLM pauses.

How it works

The instant ComfyUI starts a render, TurboLLM unloads its model and pauses new loads
When ComfyUI's queue drains, TurboLLM reloads the exact model it unloaded
Push-based, not polling — ComfyUI signals TurboLLM the moment a job starts/ends

Claude Code

TurboLLM's Anthropic-compatible endpoint means Claude Code can run against whatever model you've loaded — no cloud key, fully offline.

turbollm launch claude               # auto-loads a model if none is running
turbollm launch claude --model qwen3-8b   # load a specific model first

Develop from Source

npm install                  # daemon deps
cd web && npm install && cd ..

npm run build:web            # build the React UI -> src/webdist
npm run start                # run the daemon in dev -> :6996

npm run build                # production bundle -> dist/cli.js
node dist/cli.js --port 6996

Stack: Node ≥22 · TypeScript · Hono · node:sqlite · tsup — and a React 19 + Tailwind v4 + shadcn/ui frontend.

Troubleshooting

Common issues

Issue	Solution
`TurboLLM requires Node.js 22 or newer`	Upgrade Node: nodejs.org
Model won't load / OOM	Pick a smaller quant, lower GPU offload, or close other GPU apps
Windows Defender / SmartScreen prompt	That's the upstream `llama-server` binary on first run; allow it once
Port already in use	`turbollm --port 9000`
Slow generation	Ensure GPU offload is high and flash attention / NextN are on for supported models

Privacy

TurboLLM is offline-first: core local use needs no account, no backend, and no internet. No analytics or telemetry are collected. Your prompts, chats, files, and keys never leave your machine.

License

Source-available under the Functional Source License 1.1 (Apache-2.0 future grant) — SPDX FSL-1.1-ALv2. Free for personal use, internal business use, education, and research; the only restriction is shipping a competing product.