Anthropic API (Claude)

LLM API · by Anthropic · official site

What it actually does

Anthropic’s API provides programmatic access to the Claude family of large language models. As of mid-2026 the available tiers are Opus 4, Sonnet 4.5, and Haiku (with occasional minor version bumps). All models accept text inputs and emit text; Opus 4 and Sonnet 4.5 also accept images and PDFs as inputs. The API supports streaming, tool use (function calling), system prompts, and a retrieval‑augmented generation helper (document grounding via uploaded files). Opus 4 additionally exposes an “extended thinking” mode that allows the model to generate internal reasoning tokens before producing its final answer, effectively enabling chain‑of‑thought without visibility into that thinking. Context windows are 200K tokens for all tiers, though only Opus 4 reliably uses the full context for deep multi‑hop reasoning.

Who it’s for

Founders shipping production products that require a language model to behave predictably, follow instructions precisely, and not produce harmful or off‑brand output. Opus 4 targets applications where reasoning depth is paramount: legal document analysis, research summarization, complex code generation, financial modeling. Sonnet 4.5 is the default for most production workloads: customer‑facing chatbots, content generation, data extraction pipelines. Haiku is for cost‑sensitive, high‑volume tasks: classification, sentiment analysis, simple Q&A, real‑time streaming responses where latency matters more than nuance.

What works

Instruction adherence and formatting control. Claude is broadly the most consistent model at following explicit formatting instructions (JSON, XML, markdown) without additional parsing work. This reduces the fragility of production systems.
Tool use (function calling). The JSON‑based tool schema is clean and the model calls tools with high precision. It also interprets tool results (including errors) more gracefully than many alternatives.
Extended thinking (Opus 4 only). When the complexity budget is high, enabling extended thinking yields noticeable gains in multistep reasoning tasks (e.g., proving mathematical theorems, evaluating logic trees). It also lets you expose the model’s internal reasoning time as a controllable parameter.
Safety guardrails. Constitutional AI tuning produces a model that rarely produces toxic or unhinged output, even under adversarial prompting. For regulated industries (healthcare, finance, legal) this reduces manual review overhead.
Latency for Haiku and Sonnet. Haiku returns under 500ms for moderate token counts; Sonnet remains fast enough for conversational use. Both support streaming with negligible overhead.

What breaks

Extended thinking is unreliable for shallow prompts. Enabling it on simple tasks often adds cost and latency without benefit, and sometimes degrades output quality because the model wastes reasoning tokens on irrelevant paths.
Rate limits can bite without enterprise negotiation. Out‑of‑box limits are generous for prototyping but insufficient for production traffic above roughly 500 RPM (varies by tier). Reserved throughput requires an annual commitment and a sales conversation.
Image and PDF understanding is behind GPT‑4o and Gemini 2.5 Pro. Opus 4 handles clean text in images well but struggles with low‑quality scans, handwritten text, and complex charts. For document extraction, you still need a dedicated OCR layer.
Refusal rate on benign inputs. Claude sometimes refuses to answer harmless prompts that could be interpreted as “sensitive,” particularly when the prompt includes words like “violence” even in a historical context. Tuning system prompt or using the safety parameters only partially mitigates this.
Context window execution degrades after 120K tokens. While advertised as 200K tokens, random‑access information retrieval (e.g., “find the line that contains X”) becomes noticeably inaccurate past 100–120K tokens. Sonnet degrades faster than Opus 4. For long‑document workloads, you should implement chunking and retrieval rather than relying on the full window.

Pricing reality (what you actually pay at usage X)

Anthropic’s published per‑million‑token rates change frequently. As of mid‑2026 the approximate values are:

Model	Input ($/M tokens)	Output ($/M tokens)
Opus 4	$20	$100
Sonnet 4.5	$3.50	$15
Haiku	$0.15	$0.75

Extended thinking counts the thinking tokens as output tokens at the output rate, so a task that uses 2K thinking tokens plus 1K final output tokens is billed for 3K output tokens. Example monthly cost for a typical chatbot: 10M input tokens, 2M output tokens.

Haiku: 10M * $0.15/1M = $1.50 input + 2M * $0.75/1M = $1.50 output → $3/month.
Sonnet 4.5: 10M * $3.50/1M = $35 + 2M * $15/1M = $30 → $65/month.
Opus 4: 10M * $20/1M = $200 + 2M * $100/1M = $200 → $400/month.

Prompt caching reduces input cost by roughly 50% if you cache the system prompt and conversation prefix. Reserved throughput plans offer a volume discount (typically 15–30%) but require a minimum monthly spend. All numbers above are subject to change – check the latest pricing page before committing.

The honest comparison (vs 2–3 alternatives)

OpenAI (GPT‑4.1, GPT‑4o mini). OpenAI offers better multimodal performance (OCR, image understanding, audio input), a broader ecosystem (built‑in voice, DALL·E, code interpreter), and native fine‑tuning for custom models. GPT‑4.1 is slightly less reliable than Claude at strict instruction following, and safety moderation is handled externally (via the moderation API), which adds latency. For pure reasoning depth, extended thinking sometimes puts Opus 4 ahead of GPT‑4.1, but OpenAI’s real‑time voice and vision integration matter more for consumer‑facing products. Google Gemini 2.5 Pro / Flash. Gemini 2.5 Pro matches or beats Sonnet 4.5 on general benchmarks and has a massive 1M token context window that actually works (with linear retrieval). Gemini Flash is cheaper than Haiku ($0.05/$0.20 per million) and faster. Google’s API integrates natively with Vertex AI for enterprise deployments (IAM, private endpoints). The trade‑offs: Gemini’s tool use is less robust than Claude’s, and the safety tuning is less “constitutional” — it can produce unexpected content. For high‑volume, low‑cost workloads, Gemini Flash is currently the better choice. Meta Llama 4 (META API). Llama 4 is open‑weight and available via various providers (Together, Fireworks). It is cheaper (often < $1/M tokens) and you can self‑host. However, instruction‑following and safety are notably worse out of the box, requiring significant fine‑tuning or constrained decoding for production use. Unless you have strong appetite for in‑house model customization, Claude’s API is easier to deploy reliably.

When to use it

Use Claude API when instruction reliability, safety, and controlled output format matter more than multimodal breadth or raw cost per token.

Last verified: 2026-06-08 by kernel.