Picking the right model per agent: a practical guide — GroundPound.ai

Token costs still matter. The spread between the cheapest and most expensive models we route to is wide — from GPT-4o mini around $0.18/MTok input to Claude Opus 4.5 at $6/MTok input ($30/MTok output), with Haiku 4.5 ($1.20/MTok) and Sonnet 4.5 ($3.60/MTok) in between. Run a 6-agent team on the biggest model for everything and you'll pay meaningfully more than the recommended default mix at SMB volumes — often 2× or more — and perform no better on most lanes, because some of those agents benefit from speed more than they benefit from raw model capability.

That's why "which model should each agent use?" is the most common question new operators ask after they set up their first team — and why the unhelpful answer ("use the biggest one") is also the most expensive. The actually-helpful answer is below.

The mental model

GroundPound supports nine LLM providers (Anthropic Claude, OpenAI, Google Gemini, xAI Grok, Mistral, Together AI, plus Vertex AI Gemini, Azure OpenAI, and OpenRouter) and 33 model catalog entries across them. The right model per agent is not a brand preference; it's a function of three things:

The agent's job: Routing? Drafting? Reasoning? Refusing?
The agent's tool surface: How many tools, how complex, how many tool calls per turn?
The cost of being slightly wrong: Does a small error get noticed and corrected, or does it cause a downstream cascade?

Let's walk each.

Coordinators need precision, not size

A coordinator's job is classification. It reads an inbound message and decides into which lane it belongs — for a property team, MAINTENANCE / RENT / LEASE / LEGAL / TENANT / unknown; for a therapist's office, INTAKE / INSURANCE / SCHEDULE / RECALL / BILLING / unknown; for a publishing house, RIGHTS / ROYALTIES / PRODUCTION / AUTHOR / unknown. Then it routes.

This is a small task. A well-prompted Haiku-class model does it correctly 96-98% of the time on real prod data we've measured. Going to a Sonnet-class model improves accuracy by maybe 1 percentage point. Going to Opus or a frontier OpenAI model doesn't move the needle at all on this specific task.

So the right coordinator model is the smallest one that hits your accuracy target. Ours default to Claude Sonnet 4.5 because it gives us 99%+ on classification and the cost difference vs Haiku is negligible at coordinator-volume (a coordinator triages maybe 50-200 items a day, even at scale).

Rule:Coordinators don't benefit from frontier models. Use a small fast one. Save the budget for the specialists who actually do the work.

Bookkeepers need exactness

Roles that produce numbers — a rent-collection bookkeeper for a property manager, a CFO closing books for eight clients in parallel, a hardware-design firm rolling up a BOM, a tax practice carrying forward last year's balances — read ledgers or specs, reconcile sources, and produce numbers that get put in front of owners or auditors. A small math mistake is not a small mistake. It's a credibility-burning mistake.

For these agents, we default to the model with the strongest tool-use reliability AND the lowest hallucination rate on numerical / structured outputs. As of mid-2026, that's Claude Sonnet 4.5 or GPT-4o, depending on the org's BYOK setup. Both reliably round-trip numbers from a tool call to a structured output without inventing digits. Haiku-class models are noticeably worse at this — they sometimes "correct" a number that looked wrong to them, which is exactly the wrong instinct for an accounting (or BOM-costing, or tax-prep) role.

Rule:Roles where the output is a number need models with strong tool-use + low hallucination. Don't economize here.

Compliance agents need to refuse

A compliance / legal / regulated-domain agent's most important capability is saying "I'm not confident — escalate to operator." Models vary wildly here. Some happily generate plausible-sounding legal advice. Some default to "consult an attorney" too aggressively (which is also bad — it makes the agent useless). The right model strikes a balance: it gives correct guidance on clear-cut cases (FCRA notice timing for property managers, fair-housing protected classes, standard lease-clause meaning; HIPAA-adjacent patient-communication rules for a therapist's office; IRS notice-response deadlines for a tax practice) and refuses on uncertain ones with a useful escalation note.

We default these agents to Claude Sonnet 4.5 because it has the best calibration on our internal eval set. Opus 4.8 is slightly better but the cost-per-refusal isn't worth it — most of these turns ARE refusals.

Rule: Roles that need to refuse well need models with good calibration, not just good knowledge.

Drafting agents reward bigger models

A drafting agent — tenant relations for a property manager, author outreach for a publisher, recall-call scripts for a vet clinic, donor stewardship for a non-profit, an investor update for an early-stage founder — reads context and produces prose for human review. The output quality scales with model size in a way that classification, math, and compliance don't.

For these roles, the question becomes: do you spend the budget on a frontier model and approve fewer drafts because they're closer to ready, or run a cheaper model and budget for more iteration?

Our default: Sonnet for high-volume drafting (e.g., tenant relations or customer support handling 100+ messages a day), Opus for low-volume high-stakes (e.g., a quarterly investor update or a board memo). Per-operator policy via auto-tune.

Rule: Drafting is the one place size meaningfully helps. Match model size to the volume × stakes of the draft.

What "auto-tune" actually does

In production, GroundPound runs a multi-armed bandit per agent. Each agent has a set of eligible models (derived from your BYOK setup), and the platform routes requests across them with epsilon-exploration. The bandit watches three signals per (agent × model) pair: cost-per-success, latency-per-success, and a quality proxy derived from operator approvals.

The bandit then biases routing toward the model with the best expected value. If you're an operator who provided BYOK for 4 providers, the platform will keep all 4 in play and find the right per-agent winner empirically rather than asking you to decide a priori.

The mechanism is conservative: it requires at least 30 observations per arm before promoting, and operator-edits (you manually picked a model) are sticky for 7 days. So if you want to override the bandit, you can — and we won't fight you.

The frontier-model trap

A common mistake we see new operators make: pick Opus 4.8 for everything because "it's the best."

This is wrong for three reasons.

Cost. Already covered above: an all-Opus bill runs meaningfully higher than the recommended mix at SMB volumes — often 2× or more — with no quality loss on most lanes.

Latency.Opus is 2-3× slower than Sonnet on the same prompt. Coordinator latency matters — a coordinator that takes 8 seconds to classify an inbound message will visibly lag the operator's expectations. A coordinator that takes 2 seconds feels real-time.

The model isn't usually the bottleneck.Most of the time when an agent produces a bad output, the bug is in the prompt, the tool list, or the system policy — not the model. Upgrading to a bigger model can mask the underlying issue without fixing it. We've seen teams jump from Haiku to Opus and see the same failure mode persist, just more confidently expressed.

Practical defaults for a new team

If you're setting up your first GroundPound team and don't want to think about it:

Coordinator:Claude Sonnet 4.5 (or GPT-4o mini if you're OpenAI-aligned)
Specialists doing drafts: Claude Sonnet 4.5
Specialists doing math: Claude Sonnet 4.5 or GPT-4o
Compliance / legal / refusal-heavy roles: Claude Sonnet 4.5
High-volume routine roles (lots of triage, little drafting): Claude Haiku 4.5
Low-volume high-stakes drafting (investor updates, board memos): Claude Opus 4.8 with auto-tune off

Turn auto-tune on after you've been running for a week. Let the bandit refine from there. You can always switch back to manual.

What we believe, restated

The best model for an agent is the smallest one that does the job well. "Best" isn't a model name — it's a property of the (agent role × cost × latency × calibration) tuple. Auto-tune finds it empirically. Operators who override the auto-tune for a specific reason (regulatory, customer perception, preferred-vendor) should override; the platform respects the override. Operators who want to think about it once and never again should let auto-tune run and watch the cost-per-agent line tick down over the first 30 days.

If you want to see this in your own team, the per-agent model panel is in the Atlas → Models page after sign-in. Auto-tune logs every routing decision and shows you why it picked what it picked.