Together AI
What it does
Together AI is a cloud platform that provides inference APIs for open-source large language models (LLMs) and related generative AI models. As of mid-2026, it offers serverless (pay-per-token) and dedicated (reserved instance) endpoints for a catalog of 50+ models, including Llama 4 family (8B, 70B, 405B), Mixtral 8x22B, Qwen 2.5, DeepSeek V3, Gemma 2, and several fine-tuned variants (e.g., CodeLlama, Mistral-tuned). The platform also supports image generation via Stable Diffusion and PixArt, and offers text-to-speech and embedding APIs (varies). Fine-tuning services—both LoRA and full-weight training—are available on rented GPU clusters (A100-80G, H100). All endpoints expose an OpenAI-compatible API, making drop-in integration straightforward for existing codebases.
Who it's for
Together AI primarily serves developers and engineering teams who want the flexibility of open models without the operational overhead of self-hosting. It is a natural fit for:
- Teams building AI features that require model access at scale, but who want to avoid vendor lock-in to a single commercial API provider.
- Workloads that benefit from model transparency—auditability, reproducibility, or custom fine-tuning on proprietary data.
- Cost-sensitive applications where open models (e.g., Llama 4-70B) deliver acceptable quality at lower per-token cost than equivalents like GPT-4o or Claude 3.5 Opus.
- Rapid prototyping: quickly test multiple open models side-by-side without provisioning hardware.
What works
Performance on popular models. Llama 4-70B and Mixtral 8x22B run with consistent sub-second time-to-first-token (TTFT) for short prompts (under 2K tokens) on serverless endpoints. Streaming is smooth, and concurrency handling improved significantly over 2024–2025 – a single shared endpoint can sustain dozens of simultaneous requests without noticeable degradation. OpenAI-compatible API. The drop-in support for the/v1/chat/completions format, streaming, function calling, and JSON mode reduces migration friction. Most existing code using the OpenAI SDK works with a simple base URL change, though advanced features like parallel tool calls or structured output vary per model.
Fine-tuning pipeline. Together provides a UI and CLI to upload datasets, launch LoRA or full fine-tuning, and deploy resulting adapters as endpoints. Training speeds are competitive with other cloud GPU providers, and the ability to deploy directly to the same inference infrastructure is convenient. The fine-tuned models benefit from optimized kernels (e.g., FlashAttention-3) out of the box.
Model availability. The catalog includes many recent open releases within days of publication. DeepSeek V3 and Llama 4 were available on Together within 24 hours of their public weights. For teams that want to stay on the bleeding edge without managing download and serving themselves, this is a strong advantage over commercial APIs that curate only their own models.
What breaks
Inconsistent model behavior. Not all open models are equally tuned for instruction-following or safety. A prompt that works flawlessly on Llama 4-70B may produce unexpected output on a newer Qwen variant due to different tokenization, system prompt formatting, or alignment strategies. Without per-model adjustment, reliability dips for production use. Latency and throughput for large models. The 405B model variants require dedicated endpoints for acceptable performance; on the shared serverless tier, tokens-per-second drops significantly during peak hours (60–70 TPS vs. 120+ TPS for Llama 4-70B). Dedicated endpoints solve this but shift cost to a fixed hourly rate that may be inefficient for sporadic traffic. API reliability fluctuations. Together’s historical uptime is comparable to smaller cloud providers: generally above 99.5%, but occasional multi-minute outages occur without clear pre-warning. Their status page is reactive, and support response times on the standard tier can exceed several hours. Commercial APIs (OpenAI, Anthropic, Google) provide tighter SLAs for paid tiers. Advanced features fragmented across models. Structured output (JSON mode), function calling, and parallel tool calls are implemented inconsistently across the catalog. Some models lack support entirely. The platform does not abstract these differences—it falls on the developer to test each model’s capabilities.Pricing reality
As of June 2026, Together AI’s serverless pricing for Llama 4-70B is approximately $0.85 per 1M input tokens and $1.10 per 1M output tokens (varies by model, region, and time of day). Smaller models like Llama 4-8B run at $0.15/$0.20 per 1M tokens. Dedicated endpoints start at $1.50/hour for a single A100, scaling to $6.00/hour for an 8×H100 node (varies by region and supply). Fine-tuning costs vary by model size, number of tokens, and GPU type—for example, a full fine-tune of Llama 4-8B on a 200K-token dataset costs roughly $50 (varies). Volume discounts (10–25% off) are available for monthly commitments exceeding $10,000.
Compared to OpenAI’s GPT-4o ($2.50/$10.00 per 1M tokens) and Anthropic’s Claude 3.5 Sonnet ($3.00/$15.00 per 1M tokens), Llama 4-70B through Together is 2–3x cheaper per output token. But quality differences in reasoning, factuality, and safety matter: for many task types, the open model requires more careful prompt engineering or post-processing to approach commercial model results.
Together offers a free tier ($5 credit on signup) and a free shared endpoint for models under 13B parameters, but rate limits are aggressive (5 RPM / 100K tokens daily). No long-term contracts are required, but pricing changes without notice have occurred previously.
Honest comparison
| Dimension | Together AI (open models) | Commercial APIs (OpenAI, Anthropic, Google) | |-----------|----------------------------|---------------------------------------------| | Model diversity | Wide, many models available quickly | Limited to vendor’s own models | | Per-token cost (equivalence) | 30–60% cheaper for models of similar scale | Higher, but often delivers higher benchmark scores | | Performance/latency | Good for popular models; long tail is inconsistent | Highly optimized; consistently low P50 latency | | Reliability & SLA | ~99.5% uptime; support response measured in hours | >99.9% with fast support (enterprise plans) | | Customization | Fine-tuning and full model access, transparent weights | No fine-tuning on strongest models; limited to prompt engineering | | Safety & alignment | Models may produce more toxic or unaligned outputs; user responsible | Built-in RLHF and content filters; lower risk for sensitive applications | | Ecosystem integration | OpenAI-compatible API, but feature coverage varies | Full SDK support, mature tooling, ubiquitous |
Key trade-off: Open models on Together give you control, transparency, and cost savings, but you absorb the burden of quality assurance, safety mitigation, and the operational risk of a smaller platform. Commercial APIs trade freedom for a polished, reliable, and safer default experience.When to use
Use Together AI when:
- You need to run a specific open-source model that is not served by commercial APIs (e.g., a niche fine-tune, a model with permissive license for your use case).
- You plan to fine-tune on proprietary data and want a single platform for training and deployment.
- Your application can tolerate occasional API latency spikes or brief outages, or you can implement fallback strategies.
- Per-token cost is the primary constraint and you have done internal evaluations confirming that the open model’s quality meets your threshold.
- The application is customer-facing and requires sub-200ms p99 latency with 99.99% uptime.
- Your team lacks the resources to test and handle per-model behavioral differences.
- Safety and alignment are critical (e.g., medical advice, legal document generation) and you cannot invest in extensive guardrails.
Last verified: 2026-06-08 by kernel.