Claude Code, for open source.
Any model you can serve, on any provider.
A Claude-Code-style coding agent in your terminal, and Anthropic's claude-agent-sdk surface as a library — both running on Llama, Qwen, DeepSeek, GLM, or anything behind Ollama, vLLM, Groq, or your own GPU box. The migration is one import.
↑That's the whole diff. Every canonical Claude SDK example runs verbatim — the surface is Anthropic-shaped, the wire format underneath is OpenAI-compat or Ollama.
The mantis terminal ships in the same install — a Claude-Code-style coding agent driving the open model you choose.
One kwarg between
A terminal to code in, and a library to build with.
The mantis terminal
Point it at any directory. It reads, writes, edits, greps, and runs shell commands — Claude Code's feel, driving your local Ollama, your vLLM box, or a hosted endpoint. The input stays pinned to the bottom; replies render as Markdown; file edits come back as real, line-numbered diffs.
▄▀▄▀
▄█▀
▄██▀▀█▀
▄█ ▄███▀▀
▄▄██▀▀██▀▀▀▀▀
▀▀ █ █▀ ▀▄
▄▄▀ ▄▀ ▀▄The Python library
The same engine, as an SDK. A tool-calling loop is a few lines away — and the exact same script runs against Together, Fireworks, vLLM, or Groq by changing one string.
import asyncio
from mantis_agent import query, ClaudeAgentOptions, tool, AssistantMessage
@tool
async def get_weather(city: str) -> str:
"""Get the current weather for a city."""
return f"{city}: 67°F"
async def main():
async for msg in query(
prompt="What's the weather in SF?",
options=ClaudeAgentOptions(
model="qwen2.5:1.5b", # routes to local Ollama automatically
tools=[get_weather],
max_turns=5,
),
):
if isinstance(msg, AssistantMessage):
for block in msg.content:
if hasattr(block, "text"):
print(block.text)
asyncio.run(main())# same script, three backends — change one line
options = ClaudeAgentOptions(model="qwen2.5:7b") # → local Ollama
options = ClaudeAgentOptions(model="Qwen/Qwen2.5-72B-Instruct-Turbo") # → Together
options = ClaudeAgentOptions(model="llama-3.3-70b-versatile",
backend="https://api.groq.com/openai/v1") # → GroqStreaming dispatch, hooks, permissions, MCP, sub-agents, sessions. None of the OSS alternatives ship the whole set.
Route from the model name
qwen3:8b → Ollama. Qwen/… → Together. gpt-4o-mini → OpenAI. The URL is inferred from the model name shape; pass backend= to override.
Native, prompted, or grammar-constrained
Native tools[] where supported, prompt-engineered <tool_call> XML where not, grammar-constrained JSON where the server enforces it. Chosen per model, automatically.
Four transports, both directions
In-process via create_sdk_mcp_server, plus stdio / sse / http. Elicitation lets servers prompt the user; sampling lets them call back into the model.
Survive restarts, fork, resume
JSONL transcript persistence, fork from any checkpoint, resume from an arbitrary one, auto-compaction at a token threshold.
Compose agents as tools
Plugin(tools=, system_prompt_addition=, hooks=) merges at session start. Rewrite tool args before dispatch with PermissionResultAllow(updated_input=…).
A ceiling on every run
Per-model pricing table, max_usd and max_turns ceilings, BudgetExceededError, total_cost_usd on every ResultMessage.
Every model gets tool use — through whichever path it can actually take.
A capability table (30+ models) picks the path per model, automatically. You write one@tool; the library figures out how the model in front of it can call it.
OpenAI-compat tools[]. The fast path for anything that speaks function-calling — Qwen 2.5+, Llama 3.1+, gpt-oss.
Prompt-engineered <tool_call> XML, parsed back out. Brings tool use to Llama 2, Mistral 7B, and older Qwens that never learned the schema.
Grammar-constrained JSON when the server can enforce it. The model physically cannot emit an invalid call.
Pick the highest-ranked model that fits your hardware.
| Model | Runs | model= | Notable |
|---|---|---|---|
| Kimi K2.6 | cloud | moonshotai/Kimi-K2.6-Instruct | #1 open-weights GPQA |
| Qwen3 235B-A22B | cloud · 64 GB+ | Qwen/Qwen3-235B-A22B-Instruct-Turbo | Apache 2.0, broad leader |
| GLM-5 | cloud | zai-org/GLM-5 | Best open Arena Elo |
| MiniMax M2.5 | cloud | minimaxai/MiniMax-M2.5 | 80.2% SWE-bench |
| DeepSeek-V3.2 | cloud · 80 GB+ | deepseek-ai/DeepSeek-V3.2 | Top general-purpose OSS |
| gpt-oss-120b | cloud · 80 GB | gpt-oss:120b | OpenAI open, ~o4-mini class |
| Qwen2.5-Coder 7B | 8 GB local | qwen2.5-coder:7b | Strongest small coder |
| qwen2.5:1.5b | 4 GB local | qwen2.5:1.5b | CPU default, tool-capable |
Full ranked catalog — 20 hosted + 10 CPU-friendly tiers — in Models & backends.
A full span tree of every run — tokens and cost on the root.
agent.run → agent.turn → llm.call + tool.call, with per-model usage on the root span. Swap InMemoryTracer for OTelTracer to ship the same spans to your pipeline. Tool spans record input keys, never values — the safe choice is the only choice.
from mantis_agent import Agent, InMemoryTracer
tracer = InMemoryTracer()
agent = Agent(model="qwen2.5:7b", tools=[...], tracer=tracer)
await agent.run(...)
tracer.summary() # turns / tokens / cost_usd on the root span
tracer.write_jsonl("t.jsonl")
# ship the same spans to Datadog / Honeycomb / Tempo — zero extra code
from mantis_agent import OTelTracer
agent = Agent(model="qwen2.5:7b", tracer=OTelTracer(service_name="my-agent"))On a fresh machine, no GPU. Works on the first try.
pip install mantis-agent-sdk
mantis-agent setup-local # pulls a CPU-friendly model, smoke-tests
python my_agent.py # two tools, a 5-turn task — first tryChange one word — model= — and the same script runs against Together, Fireworks, vLLM, llama.cpp, or Groq.