Scaling to a fleet
One agent is the starting line. A fleet is the point. This page covers the operational choices you face when you go from 1 → 5 → 20 → 100 agents.
When to add the next agent
Three honest signals you're ready:
- You trust the output. If you're still inspecting every diff the first agent produces, stop. Get to confidence first.
- The first agent has a steady queue. Check
/admin/jobs. If >25% of the time it's idle, more agents won't help — more work will. - You hit
max_concurrent_tasks_per_agent. The Control Plane starts queuing. Time for more concurrency, which means more agents.
How many agents do I need?
Two limits matter:
max_agentson your license — the hard cap. Visible inagentina statuson any host.- Hardware — each agent uses ~200 MB of RAM idle, plus whatever its workload needs (a tester running Playwright wants 1–2 GB available). Pack budget accordingly.
Practical sizing for a coder fleet on a typical repo:
| Repo size | Coders | Reviewers | Testers |
|---|---|---|---|
| < 50k LOC | 1–2 | 1 | 1 |
| 50k – 500k LOC | 3–5 | 1–2 | 1–2 |
| > 500k LOC | 5–15 | 2–3 | 2–4 |
One host or many?
You can run multiple agents on one host (each is a separate systemctl unit + state dir) or one agent per host. Both work. The trade-offs:
| Pattern | Pro | Con |
|---|---|---|
| One big host, many agents | Cheaper, simpler to monitor | One bad agent can starve neighbors; single point of failure |
| One agent per host | Isolation, easier capacity planning | More installs to keep current |
| Hybrid (typical) | indexer on its own host, coder fleet packed onto 1–2 bigger hosts | Two patterns to operate |
Installing N agents
Each install needs its own activation token. Mint one per intended agent in /portal/tokens; the first machine to redeem each token claims it.
To run multiple agents on the same host, override the state dir + systemd unit name:
Monitor a fleet
The four numbers that matter, in order:
- Online ratio — agents heartbeating in the last 3 min ÷ total. See /portal. Healthy fleet: >95%.
- Job throughput — completed jobs per hour. See
/admin/jobs. If it's dropping with the same workload, something's wrong upstream. - Anomalies — unresolved findings. See
/admin/anomalies. Aim for 0 unresolved at end of day. - Version skew — how many agents are not on the latest release. See
/admin/agents. Anything >1 minor version behind, plan an upgrade.
Upgrading a fleet
The Control Plane sends every agent an update_available hint on heartbeat. You decide when to apply it.
The safe pattern, every time:
- Upgrade one agent. Let it run a full day.
- Inspect: did its job throughput drop? Did
agentina statusstay healthy? - If yes, upgrade the rest in batches of 25%. Watch each batch for an hour before the next.
Rollback is automatic on smoke-check failure and on systemd start failure — see /docs/updates for the full mechanism.
Anti-patterns
- Upgrading the whole fleet at once. A bad release takes everyone down. Two-level rollback helps but doesn't replace canarying.
- Identical hosts. When something OS-level breaks, you lose every agent. Mix hosts across at least two AZ's / regions for anything you can't survive losing.
- No indexer in the fleet. Already mentioned. Worth repeating.
- Sharing one activation token across machines. Tokens are single-use; the second machine's install will fail. Mint one per agent.
Next
- Update lifecycle — the full upgrade mechanism, channels, rollback.
- Troubleshooting — every error message + what it means.
- Security model — read this before scaling past your own laptop.