Anthropic API (Claude)

LLM API · by Anthropic · official site

What it actually does

Anthropic’s API provides programmatic access to the Claude family of large language models. As of mid-2026 the available tiers are Opus 4, Sonnet 4.5, and Haiku (with occasional minor version bumps). All models accept text inputs and emit text; Opus 4 and Sonnet 4.5 also accept images and PDFs as inputs. The API supports streaming, tool use (function calling), system prompts, and a retrieval‑augmented generation helper (document grounding via uploaded files). Opus 4 additionally exposes an “extended thinking” mode that allows the model to generate internal reasoning tokens before producing its final answer, effectively enabling chain‑of‑thought without visibility into that thinking. Context windows are 200K tokens for all tiers, though only Opus 4 reliably uses the full context for deep multi‑hop reasoning.

Who it’s for

Founders shipping production products that require a language model to behave predictably, follow instructions precisely, and not produce harmful or off‑brand output. Opus 4 targets applications where reasoning depth is paramount: legal document analysis, research summarization, complex code generation, financial modeling. Sonnet 4.5 is the default for most production workloads: customer‑facing chatbots, content generation, data extraction pipelines. Haiku is for cost‑sensitive, high‑volume tasks: classification, sentiment analysis, simple Q&A, real‑time streaming responses where latency matters more than nuance.

What works

What breaks

Pricing reality (what you actually pay at usage X)

Anthropic’s published per‑million‑token rates change frequently. As of mid‑2026 the approximate values are:

ModelInput ($/M tokens)Output ($/M tokens)
Opus 4$20$100
Sonnet 4.5$3.50$15
Haiku$0.15$0.75
Extended thinking counts the thinking tokens as output tokens at the output rate, so a task that uses 2K thinking tokens plus 1K final output tokens is billed for 3K output tokens. Example monthly cost for a typical chatbot: 10M input tokens, 2M output tokens. Prompt caching reduces input cost by roughly 50% if you cache the system prompt and conversation prefix. Reserved throughput plans offer a volume discount (typically 15–30%) but require a minimum monthly spend. All numbers above are subject to change – check the latest pricing page before committing.

The honest comparison (vs 2–3 alternatives)

OpenAI (GPT‑4.1, GPT‑4o mini). OpenAI offers better multimodal performance (OCR, image understanding, audio input), a broader ecosystem (built‑in voice, DALL·E, code interpreter), and native fine‑tuning for custom models. GPT‑4.1 is slightly less reliable than Claude at strict instruction following, and safety moderation is handled externally (via the moderation API), which adds latency. For pure reasoning depth, extended thinking sometimes puts Opus 4 ahead of GPT‑4.1, but OpenAI’s real‑time voice and vision integration matter more for consumer‑facing products. Google Gemini 2.5 Pro / Flash. Gemini 2.5 Pro matches or beats Sonnet 4.5 on general benchmarks and has a massive 1M token context window that actually works (with linear retrieval). Gemini Flash is cheaper than Haiku ($0.05/$0.20 per million) and faster. Google’s API integrates natively with Vertex AI for enterprise deployments (IAM, private endpoints). The trade‑offs: Gemini’s tool use is less robust than Claude’s, and the safety tuning is less “constitutional” — it can produce unexpected content. For high‑volume, low‑cost workloads, Gemini Flash is currently the better choice. Meta Llama 4 (META API). Llama 4 is open‑weight and available via various providers (Together, Fireworks). It is cheaper (often < $1/M tokens) and you can self‑host. However, instruction‑following and safety are notably worse out of the box, requiring significant fine‑tuning or constrained decoding for production use. Unless you have strong appetite for in‑house model customization, Claude’s API is easier to deploy reliably.

When to use it

Use Claude API when instruction reliability, safety, and controlled output format matter more than multimodal breadth or raw cost per token.

Last verified: 2026-06-08 by kernel.