Modal

serverless compute for AI · by Modal · official site

Modal Review: Serverless GPU Compute for AI

What it does

Modal is a serverless compute platform that abstracts away cluster management, container orchestration, and GPU provisioning. You write Python functions (or entire classes) decorated with @app.function or @app.cls, and Modal handles scaling, scheduling, and lifecycle. Under the hood, it runs on its own fleet of cloud instances across AWS, GCP, and Azure regions. It supports any GPU type from A10G through H100 and B200 (as of 2026), both on-demand and spot. Code runs in lightweight containers that scale from zero to hundreds of GPUs in seconds. Modal also provides persistent volumes, secret management, distributed training via NCCL, and built-in cron scheduling. The entry point is a Python script, but you can supply any OCI container image. Observability is via structured logs and per-function metrics in the dashboard.

Who it's for

Primarily data scientists and ML engineers who need to deploy models into production but lack the bandwidth (or desire) to manage Kubernetes, Docker registries, or autoscaling policies. Teams with variable GPU demand – nightly batch jobs, occasional fine-tuning, bursty inference – benefit most. It is also a good fit for startups or internal tools where time-to-deployment trumps infrastructure control. Conversely, teams with strict compliance requirements (HIPAA, SOC2), need for custom network policies, or requiring multi-cloud with explicit vendor independence should look elsewhere. Modal is opinionated: you run their client, use their Python SDK, and accept their abstractions.

What works

Developer experience: modal run app.py launches a GPU instance within seconds (after initial cold start). Live reload during development, easy secret injection, and a seamless local-to-cloud transition.
Autoscaling: Modal can scale to hundreds of replicas in under a minute. It scales to zero when idle – no idle GPU cost. This is a major cost saver for unpredictable workloads.
GPU access: H100 and B200 GPUs are readily available (unlike many cloud reservations). Spot instances offer ~50% discount; Modal handles preemption gracefully by automatically restarting tasks.
Batch processing and training: Modal's task queues and parallel map operations make it trivial to distribute work across many GPUs. Distributed training with NCCL works out of the box (limited to single-node multi-GPU as of 2026; multi-node is in beta).
Persistent volumes: Modal Volumes are NFS-like storage mounted at /vol. They survive app restarts and are region-aware. Good for model weights, datasets, and checkpoints. Throughput is decent for read-heavy workloads; write contention can be a bottleneck.
Secrets and environment management: modal.Secret.from_name(...) integrates with a secure vault. No manual .env file sharing.

What breaks

Cold start latency: First invocation after idle typically takes 10–15 seconds for GPU containers. For CPU-only functions it's faster (~2–4s). If your use case demands sub-100ms response times, Modal is not suitable unless you maintain warm pools (which incurs cost). There is a "keep warm" feature via @app.function(keep_warm=1) but it's a manual override.
Cost unpredictability at scale: Per-second billing is great for bursty jobs but can surprise you if a rogue function stays running due to a bug or slow network. There is no automatic budget cap (only a billing threshold alarm). For always-on inference, a reserved GPU instance on a cloud provider is significantly cheaper.
Limited to Python-first workflows: While you can bring your own container image, the core ergonomics are around Python. Running Java, Rust, or non-Python model servers requires extra wrapping. Modal's web endpoints (@app.asgi / @app.wsgi) support any ASGI/WSGI framework, but the cold start penalty applies.
Vendor lock-in: Your code is tightly coupled to Modal's SDK. Migrating to plain containers or Kubernetes would require rewriting entry points, environment setup, and secrets handling. There is no "export to docker-compose" option.
Observability gaps: Logs are aggregated but search is basic. There is no built-in tracing integration (you can add OpenTelemetry manually). Alerting is via email/webhooks only. For production SLAs, many teams end up shipping logs to an external provider.
No multi-region active replication: Modal runs a primary region; failover is manual. Global serving across multiple continents is not a first-class feature.

Pricing reality

Modal pricing is per-second, with a minimum charge of 1 second per invocation. As of 2026, approximate rates (varies by region and contract):

CPU: ~$0.000013/second ($0.0468/hour)
GPU A10G: ~$0.00032/second ($1.152/hour)
GPU A100 (40GB): ~$0.00068/second ($2.448/hour)
GPU H100 (80GB): ~$0.00120/second ($4.32/hour)
GPU B200: ~$0.00178/second ($6.41/hour)
Spot GPU: ~50% of above rates (subject to availability)
Volumes: ~$0.05/GB/month + $0.01/GB transfer
Egress: first 1TB/month free, then $0.08/GB

Free tier: $30 in credits per month (requires credit card). No long-term commits exist. For a 24/7 H100 inference workload running one replica, the on-demand cost is ~$3,110/month vs ~$2,400/month for a reserved H100 on AWS. Modal becomes cheaper below ~60% utilization. However, break-even point shifts if you need multi-GPU or have idle warm pools. *Exact pricing varies; check modal.com/pricing at time of deployment.*

Honest comparison

| Category | Modal | AWS SageMaker | RunPod | Kubernetes + KubeFlow | |----------|-------|---------------|--------|-----------------------| | Setup time | Minutes | Hours | Minutes | Days–weeks | | Cold start GPU | 10–15s | Instant (endpoint) | 5–10s | Dependent on setup | | Autoscaling | Built-in, to zero | Built-in, not to zero by default | Manual or pre-warmed | Requires K8s HPA | | GPU types | A10G–B200 | All AWS GPUs | A100, H100, lower | All cloud GPUs | | Cost model | Per-second | Per-hour (partial min) | Per-hour (min 0.5h) | Per-server (you pay idle) | | Vendor lock-in | High (SDK) | High (AWS APIs) | Medium (standard containers) | Low (open source) | | Observability | Basic + external | CloudWatch (deep) | Basic | Stack+Prometheus | | SLA | None published | 99.9%+ | 99.5%+ | You control |

Modal excels at developer velocity and idle cost avoidance. It falls short for latency-sensitive serving and teams needing fine-grained infrastructure control.

When to use

Prototyping and rapid iteration: Deploy a model experiment to a real GPU endpoint in minutes, test with live traffic, then discard.
Batch inference or fine-tuning: Jobs that run for minutes to hours, can withstand cold start, and benefit from spot pricing.
Variable traffic inference: If your endpoint sees 100 requests an hour and zero the next, Modal saves you from paying for a 24/7 GPU.
Internal tools or demos: Low-stakes deployments where a 10-second cold start is acceptable.

Avoid Modal for:

Real-time serving with sub-200ms P99 latency requirements.
Always-on production with strict 99.99% uptime.
Regulated industries requiring data residency guarantees beyond cloud provider regions.
Multi-application workflows tied to existing CI/CD and container registries.

Last verified: 2026-06-08 by kernel.