MIT licensed · Apple Silicon

A harness for running small LLMs on the Mac you already have.

One harness, five backends — Ollama, LM Studio, MLX, llama.cpp, and OpenRouter. Real filesystem and shell tools, approval gates on every action, and parsing that catches tool calls even when a 7B forgets the format.

~/proj — small-harness# profile: mac-mini-16gb · backend: ollama · model: qwen2.5-coder:7b
small-harness › the auth tests are failing on main, can you find and fix it?
› tool · shell       cargo test auth                approved
› tool · read_file   src/auth.rs                    approved
› tool · apply_patch src/auth.rs (lines 41-48)      approved
› tool · shell       cargo test auth                approved

Fixed. validate_token() was returning Ok(()) for expired tokens
because the expiry check ran after the signature check short-circuited.
Moved the expiry check first. All 14 auth tests pass.

small-harness › /compare openai/gpt-4o-mini
  qwen2.5-coder:7b   ✓ 4 tools · 11.2s · 52 tok/s
  gpt-4o-mini        ✓ 4 tools ·  5.8s · cloud
  diff: same fix; gpt-4o-mini also added a doc comment.

Small models got good. Most tooling hasn't caught up.

Features
01

Local-first

OpenAI-compatible chat completions against Ollama, LM Studio, MLX, or llama.cpp running on your machine.

02

Cloud A/B in one keystroke

/compare any prompt against an OpenRouter model. See if 7B is enough before paying for 400B.

03

Hardware profiles

mac-mini-16gb and mac-studio-32gb ship with model and context defaults that just work.

04

Real tools, real approvals

Read, write, edit, grep, glob, list-dir, shell, apply-patch — each with a per-tool approval gate and diff preview.

05

Robust tool-call parsing

An inline JSON detector catches tool calls even when small models forget the prescribed format.

06

Pre-warmed startup

Populates the prompt-eval cache before your first message so the first reply doesn't feel cold.

07

Efficiency mode

Auto-selects tool schemas to fit the context budget and shows you exactly where the tokens went.

08

Streaming output

Tokens stream as they arrive, with grouped tool-call display so the transcript stays readable.

09

Sessions you can resume

Append-only JSONL logs. List, resume, or export any past conversation from the prompt.

Five backends, same harness.

Backends
Backend Default URL Best for
Ollamalocalhost:11434/v1Easiest setup; mature tool-call templates.
LM Studiolocalhost:1234/v1GUI model browser; explicit load and unload.
MLXlocalhost:8080/v1Fastest inference on Apple Silicon.
llama.cpplocalhost:8080/v1Direct GGUF serving for full control.
OpenRouteropenrouter.ai/api/v1Cloud A/B comparison and frontier models.

Install in a minute.

Quick start

Requirements

  1. Rust 1.75 or newer (rustup)
  2. A local backend — Ollama is the gentlest start
  3. An OPENROUTER_API_KEY if you want /compare
# clone, configure, run $ git clone https://github.com/GetSmallAI/SmallHarness.git $ cd SmallHarness $ cp .env.example .env $ cargo run --release

Slash commands

/backend /profile /model /tools /compare /session /sessions /resume /export /doctor /bench /eval /new /help