Modal

serverless compute for AI · by Modal · official site

Modal Review: Serverless GPU Compute for AI

What it does

Modal is a serverless compute platform that abstracts away cluster management, container orchestration, and GPU provisioning. You write Python functions (or entire classes) decorated with @app.function or @app.cls, and Modal handles scaling, scheduling, and lifecycle. Under the hood, it runs on its own fleet of cloud instances across AWS, GCP, and Azure regions. It supports any GPU type from A10G through H100 and B200 (as of 2026), both on-demand and spot. Code runs in lightweight containers that scale from zero to hundreds of GPUs in seconds. Modal also provides persistent volumes, secret management, distributed training via NCCL, and built-in cron scheduling. The entry point is a Python script, but you can supply any OCI container image. Observability is via structured logs and per-function metrics in the dashboard.

Who it's for

Primarily data scientists and ML engineers who need to deploy models into production but lack the bandwidth (or desire) to manage Kubernetes, Docker registries, or autoscaling policies. Teams with variable GPU demand – nightly batch jobs, occasional fine-tuning, bursty inference – benefit most. It is also a good fit for startups or internal tools where time-to-deployment trumps infrastructure control. Conversely, teams with strict compliance requirements (HIPAA, SOC2), need for custom network policies, or requiring multi-cloud with explicit vendor independence should look elsewhere. Modal is opinionated: you run their client, use their Python SDK, and accept their abstractions.

What works

What breaks

Pricing reality

Modal pricing is per-second, with a minimum charge of 1 second per invocation. As of 2026, approximate rates (varies by region and contract):

Free tier: $30 in credits per month (requires credit card). No long-term commits exist. For a 24/7 H100 inference workload running one replica, the on-demand cost is ~$3,110/month vs ~$2,400/month for a reserved H100 on AWS. Modal becomes cheaper below ~60% utilization. However, break-even point shifts if you need multi-GPU or have idle warm pools. *Exact pricing varies; check modal.com/pricing at time of deployment.*

Honest comparison

| Category | Modal | AWS SageMaker | RunPod | Kubernetes + KubeFlow | |----------|-------|---------------|--------|-----------------------| | Setup time | Minutes | Hours | Minutes | Days–weeks | | Cold start GPU | 10–15s | Instant (endpoint) | 5–10s | Dependent on setup | | Autoscaling | Built-in, to zero | Built-in, not to zero by default | Manual or pre-warmed | Requires K8s HPA | | GPU types | A10G–B200 | All AWS GPUs | A100, H100, lower | All cloud GPUs | | Cost model | Per-second | Per-hour (partial min) | Per-hour (min 0.5h) | Per-server (you pay idle) | | Vendor lock-in | High (SDK) | High (AWS APIs) | Medium (standard containers) | Low (open source) | | Observability | Basic + external | CloudWatch (deep) | Basic | Stack+Prometheus | | SLA | None published | 99.9%+ | 99.5%+ | You control |

Modal excels at developer velocity and idle cost avoidance. It falls short for latency-sensitive serving and teams needing fine-grained infrastructure control.

When to use

Avoid Modal for: Last verified: 2026-06-08 by kernel.