claude code · claude agent sdk · for open models

Claude Code, for open source.
Any model you can serve, on any provider.

A Claude-Code-style coding agent in your terminal, and Anthropic's claude-agent-sdk surface as a library — both running on Llama, Qwen, DeepSeek, GLM, or anything behind Ollama, vLLM, Groq, or your own GPU box. The migration is one import.

agent.py

-from claude_agent_sdk import query, ClaudeAgentOptions, tool

+from

↑That's the whole diff. Every canonical Claude SDK example runs verbatim — the surface is Anthropic-shaped, the wire format underneath is OpenAI-compat or Ollama.

Quickstart →Read the docs

The mantis terminal ships in the same install — a Claude-Code-style coding agent driving the open model you choose.

831 testsPython 3.11–3.13Apache-2.0v1.21.0

One kwarg between

OllamavLLMllama.cppTGITogetherFireworksGroqOpenRouterCerebrasOpenAIGeminiModal

two ways in · one pip install

A terminal to code in, and a library to build with.

The mantis terminal

Point it at any directory. It reads, writes, edits, greps, and runs shell commands — Claude Code's feel, driving your local Ollama, your vLLM box, or a hosted endpoint. The input stays pinned to the bottom; replies render as Markdown; file edits come back as real, line-numbered diffs.

mantis — ~/code/todo-api

            ▄▀▄▀
           ▄█▀
        ▄██▀▀█▀
    ▄█ ▄███▀▀
 ▄▄██▀▀██▀▀▀▀▀
 ▀▀ █  █▀ ▀▄
 ▄▄▀  ▄▀   ▀▄

Mantis Code v1.21.0

qwen2.5-coder:7b · Ollama (local)

~/code/todo-api

›build me a fastapi todo app

⚒ Edit app/main.py +12 -0

1+ from fastapi import FastAPI

2+ app = FastAPI()

3+ todos: list[str] = []

…

● Done — run it with uvicorn app.main:app --reload.

✻ Undulating…(3s · esc to interrupt)

The Python library

The same engine, as an SDK. A tool-calling loop is a few lines away — and the exact same script runs against Together, Fireworks, vLLM, or Groq by changing one string.

quickstart.py

import asyncio
from mantis_agent import query, ClaudeAgentOptions, tool, AssistantMessage

@tool
async def get_weather(city: str) -> str:
    """Get the current weather for a city."""
    return f"{city}: 67°F"

async def main():
    async for msg in query(
        prompt="What's the weather in SF?",
        options=ClaudeAgentOptions(
            model="qwen2.5:1.5b",   # routes to local Ollama automatically
            tools=[get_weather],
            max_turns=5,
        ),
    ):
        if isinstance(msg, AssistantMessage):
            for block in msg.content:
                if hasattr(block, "text"):
                    print(block.text)

asyncio.run(main())

# same script, three backends — change one line
options = ClaudeAgentOptions(model="qwen2.5:7b")                       # → local Ollama
options = ClaudeAgentOptions(model="Qwen/Qwen2.5-72B-Instruct-Turbo")  # → Together
options = ClaudeAgentOptions(model="llama-3.3-70b-versatile",
                            backend="https://api.groq.com/openai/v1")  # → Groq

the whole surface

Streaming dispatch, hooks, permissions, MCP, sub-agents, sessions. None of the OSS alternatives ship the whole set.

one api, many backends

Route from the model name

qwen3:8b → Ollama. Qwen/… → Together. gpt-4o-mini → OpenAI. The URL is inferred from the model name shape; pass backend= to override.

real tool use

Native, prompted, or grammar-constrained

Native tools[] where supported, prompt-engineered <tool_call> XML where not, grammar-constrained JSON where the server enforces it. Chosen per model, automatically.

full mcp

Four transports, both directions

In-process via create_sdk_mcp_server, plus stdio / sse / http. Elicitation lets servers prompt the user; sampling lets them call back into the model.

sessions

Survive restarts, fork, resume

JSONL transcript persistence, fork from any checkpoint, resume from an arbitrary one, auto-compaction at a token threshold.

sub-agents + plugins

Compose agents as tools

Plugin(tools=, system_prompt_addition=, hooks=) merges at session start. Rewrite tool args before dispatch with PermissionResultAllow(updated_input=…).

budget

A ceiling on every run

Per-model pricing table, max_usd and max_turns ceilings, BudgetExceededError, total_cost_usd on every ResultMessage.

universal tool use

Every model gets tool use — through whichever path it can actually take.

A capability table (30+ models) picks the path per model, automatically. You write one@tool; the library figures out how the model in front of it can call it.

ANative

OpenAI-compat tools[]. The fast path for anything that speaks function-calling — Qwen 2.5+, Llama 3.1+, gpt-oss.

BPrompted

Prompt-engineered <tool_call> XML, parsed back out. Brings tool use to Llama 2, Mistral 7B, and older Qwens that never learned the schema.

CConstrained

Grammar-constrained JSON when the server can enforce it. The model physically cannot emit an invalid call.

ranked · picked by where they run

Pick the highest-ranked model that fits your hardware.

Model	Runs	model=	Notable
Kimi K2.6	cloud	moonshotai/Kimi-K2.6-Instruct	#1 open-weights GPQA
Qwen3 235B-A22B	cloud · 64 GB+	Qwen/Qwen3-235B-A22B-Instruct-Turbo	Apache 2.0, broad leader
GLM-5	cloud	zai-org/GLM-5	Best open Arena Elo
MiniMax M2.5	cloud	minimaxai/MiniMax-M2.5	80.2% SWE-bench
DeepSeek-V3.2	cloud · 80 GB+	deepseek-ai/DeepSeek-V3.2	Top general-purpose OSS
gpt-oss-120b	cloud · 80 GB	gpt-oss:120b	OpenAI open, ~o4-mini class
Qwen2.5-Coder 7B	8 GB local	qwen2.5-coder:7b	Strongest small coder
qwen2.5:1.5b	4 GB local	qwen2.5:1.5b	CPU default, tool-capable

Full ranked catalog — 20 hosted + 10 CPU-friendly tiers — in Models & backends.

observability, shipped

A full span tree of every run — tokens and cost on the root.

agent.run → agent.turn → llm.call + tool.call, with per-model usage on the root span. Swap InMemoryTracer for OTelTracer to ship the same spans to your pipeline. Tool spans record input keys, never values — the safe choice is the only choice.

tracing.py

from mantis_agent import Agent, InMemoryTracer

tracer = InMemoryTracer()
agent  = Agent(model="qwen2.5:7b", tools=[...], tracer=tracer)
await agent.run(...)

tracer.summary()            # turns / tokens / cost_usd on the root span
tracer.write_jsonl("t.jsonl")

# ship the same spans to Datadog / Honeycomb / Tempo — zero extra code
from mantis_agent import OTelTracer
agent = Agent(model="qwen2.5:7b", tracer=OTelTracer(service_name="my-agent"))

does it actually work?

On a fresh machine, no GPU. Works on the first try.

pip install mantis-agent-sdk
mantis-agent setup-local     # pulls a CPU-friendly model, smoke-tests
python my_agent.py           # two tools, a 5-turn task — first try

Change one word — model= — and the same script runs against Together, Fireworks, vLLM, llama.cpp, or Groq.

Get started →Star on GitHub

Claude Code, for open source.Any model you can serve, on any provider.