Writing archive

Cost per task

Small Models Are Catching Up. Your AI Strategy Should Notice.

Do not buy model hype. Benchmark the work your company actually does, then route each job to the cheapest model that reliably clears the bar.

May 202612 min
cost per taskmodel routingopen-weight AI
Model routing board comparing frontier, small proprietary, and open-weight models by task quality and cost

Smartest is the wrong benchmark

For the last few years the default enterprise pattern has been simple: send the work to the most capable proprietary model you can get, and worry about the economics later. That made sense when smaller models were mostly good for routing, extraction, classification, or demos. It makes a lot less sense now.

The bottom end of the model spectrum has moved. Smaller proprietary models, hosted open-weight models, and models that run on consumer or prosumer hardware are now good enough for real, scoped business work. Not everything, not frontier-level reasoning, not high-risk synthesis. But enough of the daily work that the economics are worth a hard look.

Forget whether a model can beat the best model in the world. The question that pays the bills is whether it can clear the quality bar for a specific job at a much lower cost per accepted task.

Start with the work

Most AI conversations still over-index on model prestige. A new model drops, a leaderboard reshuffles, a few demos make the rounds, and teams start asking whether they should move everything onto the latest release. That is an expensive way to run AI.

A better way is to start with the work people actually hand to models every day: drafting internal emails, summarizing meetings, classifying customer requests, pulling fields out of documents, writing first-pass reports, checking spreadsheet logic, running narrow financial analysis, building agent subroutines, transforming data, searching internal documents, and generating recommendations.

Those are different jobs, with different risk profiles, error tolerances, latency requirements, and cost ceilings. Running them all through the same model is not sophistication. It is the absence of architecture.

A cost-per-task benchmark changes the conversation

I have been running a benchmark across 100 real-world financial-analysis interactions. These are not abstract leaderboard puzzles. They are the kind of work people actually ask models to do: read intent, run the calculations, reason through a financial question, and come back with something consistent enough to use.

For this kind of task, I treat roughly 94 percent accuracy as the point where a model becomes even remotely viable without drowning you in review overhead. That number is not universal. A marketing rewrite can live with a lower bar; a regulated financial recommendation needs a much higher one plus human review. But for this benchmark, 94 percent is about where I start paying serious attention.

The proprietary leaders did well, as they should. The surprise was how close the smaller and open-weight models came. A locally run Gemma 4 26B A4B setup crossed the same practical viability threshold I use for this financial-analysis workflow. It did not beat the leader, but it may not need to.

Cost per token is the wrong economic unit

A provider rate card tells you what a million input or output tokens cost. It does not tell you what a finished business task costs, and that gap matters, because models do not burn tokens the same way. One solves a task in a short answer; another rambles. One needs three retries; another nails the schema on the first try.

Cost per accepted task is the metric that actually helps: model cost plus tool cost plus infrastructure cost plus retry cost plus review cost plus failure cost, divided by the outputs you actually accept.

A cheap model that needs constant review is not actually cheap. An expensive model that nails a high-risk task on the first pass can be worth every penny. A local model with zero marginal token cost is not free once you count hardware, power, monitoring, and engineering. And a frontier model rewriting routine internal emails all day is not a smart use of anything.

Open-weight models are no longer only classification engines

For years I wanted small models to be good for more than classification, and most of the time they were not. They could route, label, summarize at a shallow level, and pull out fields. The moment a task needed multi-step reasoning, calculation discipline, or real domain context, the gap showed up fast.

That is changing. Gemma, Qwen, Llama, and the other open-weight families are turning into credible building blocks for scoped business workflows. They are not reliable across the board, and they still need evaluation. But they now form a middle layer that did not exist before, sitting between frontier APIs and bad local inference.

That middle layer matters because the enterprise model portfolio no longer has to be one premium model for everything. You can route email drafting, classification, extraction, report drafting, calculation checks, and agent subroutines to different tiers, based on how each one benchmarks and how risky it is.

Benchmark the boring work

The biggest AI savings are not hiding in the debate over which model is most impressive. They are in the boring work your company does every day: emails, summaries, extractions, spreadsheet explanations, internal analyses, report drafts, customer-request classification, financial calculations, and agent subroutines. Benchmark that.

Once you know the jobs, you can measure the models against them, route each job to the right tier, and watch the economics shift. The sequence is dull, and it works.

Small models are not catching up because they suddenly got better than frontier models. They are catching up because most tasks never needed a frontier model to begin with.