What You'll Pay For Ai Agents Will Be Wildly Variable And Unpredictable

2 hours ago

Follow ZDNET: Add america arsenic a preferred source on Google.

ZDNET's cardinal takeaways

AI's costs successful position of tokens soars erstwhile utilizing agents.
Agents are inconsistent and can't foretell their full token usage.
Users must request value transparency and capacity guarantees.

Among each nan challenges of implementing agentic artificial intelligence, nan least-understood rumor is cost. The providers of AI, specified arsenic OpenAI, Google, and Anthropic, person value lists, but nary of those listed prices show users what nan last measure will beryllium to really lick a problem.

The result, according to a caller study of costs from nan University of Michigan and collaborating institutions, could beryllium sticker shock: soaring and unpredictable costs of agents.

The study, by lead writer Longju Bai of Michigan and collaborators astatine Stanford University, All Hands AI, Google's DeepMind unit, Microsoft, and MIT, titled "How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption successful Agentic Coding Tasks," is, according to nan authors, "the first systematic study connected AI Agent token consumption."

The study was posted connected nan arXiv pre-print server.

It is noteworthy for having arsenic its writer a salient Stanford economist who has commented extensively connected AI's effect connected productivity, Erik Brynjolfsson.

The top-level uncovering is that agents devour orders of magnitude much tokens than turn-by-turn, simple, prompt-based chats -- deliberation 3,500 times nan number of tokens for an supplier arsenic for a information of prompts pinch ChatGPT.

Also: AI agents are fast, loose, and retired of control, MIT study finds

A token is nan basal portion of accusation processed by an AI model. It could beryllium a portion of a word, a full word, aliases conscionable a punctuation mark, depending connected really a exemplary chops information into pieces.

You mightiness expect agents to costs much successful tokens, but nan study reveals much alarming facts. Two different models tin person wildly different token costs for nan aforesaid task. And nan aforesaid exemplary tin person different costs each clip that it useful connected nan aforesaid problem, utilizing arsenic galore arsenic doubly nan number of tokens connected 1 juncture compared to another.

The worst portion is that nary of this tin beryllium predicted. Agents, Bai and squad found, cannot reliably estimate really galore tokens they will yet devour for a fixed task.

"Agentic tasks are uniquely expensive," they wrote, while much tokens don't needfully amended results. "Simply scaling token usage whitethorn not lead to higher execution performance," they wrote, and, "[AI] models systematically underestimate nan tokens they need.

The rising costs and nan uncertainty of occurrence are successful nary measurement accounted for successful today's value lists from OpenAI and others. The activity suggests location is nary easy hole to nan matter. The champion users tin do is to group difficult limits connected agentic machine use, perchance causing agents to halt earlier completing tasks.

(Disclosure: Ziff Davis, ZDNET's genitor company, revenge an April 2025 suit against OpenAI, alleging it infringed Ziff Davis copyrights successful training and operating its AI systems.)

The large image is that users collectively will person to push backmost connected OpenAI and nan different vendors and request immoderate shape of reliable costs estimation and guarantees of task performance.

We reached retired to OpenAI, Google, and Anthropic for comment.

Counting token costs

To study costs, Bai and squad utilized nan open-source agentic AI model OpenHands, developed by scholars astatine nan University of Illinois Urbana-Champaign and collaborating institutions. They utilized OpenHands to build agents, which they past tested connected nan open-source coding benchmark trial SWE-Bench. The SWE-Bench tasks are taken from existent GitHub issues.

Also: AI agents of chaos? New investigation shows really bots talking to bots tin spell sideways fast

They first recovered nan comparative strengths of models. OpenAI's ChatGPT 5 and 5.2 "achieve beardown accuracy astatine debased cost," though they are not nan astir accurate. Anthropic's Claude Sonnet-4.5 achieved nan highest accuracy but astatine higher token costs. Google's Gemini-3-Pro was location successful nan middle. And nan Kimi-K2 exemplary from Chinese AI laboratory Moonshot whitethorn person nan worst comparative mix: nan astir tokens to execute nan lowest accuracy.

u-michigan-2026-token-efficiency-and-accuracy

The authors suggested nan quality successful tokens is based connected unsocial properties of really models are architected: "The spread is not driven by task trouble aliases by immoderate models attempting harder problems. Instead, nan aforesaid task is simply much costly for immoderate models than others, reflecting a behavioral inclination of nan exemplary alternatively than a spot of nan problem."

But nan rumor is not 1 of amended aliases worse models because moreover nan aforesaid exemplary tin return doubly arsenic galore tokens to lick nan aforesaid problem from 1 "run" of nan task to nan next.

"The astir costly runs double nan token and monetary costs of nan slightest costly runs," they observed, "suggesting that nan agent's token depletion has ample variances moreover erstwhile moving connected precisely nan aforesaid problem."

u-michigan-2026-max-and-min-token-use-by-various-models

The instruction is that much tokens don't needfully get you amended results. "Simply scaling token usage whitethorn not lead to higher execution performance," they wrote.

In fact, nan authors recovered that mostly activity tin get worse nan longer an agentic spends connected a task. "Accuracy often peaks astatine intermediate costs and saturates astatine higher costs," they observed. "Agent behaviour becomes progressively unstable connected much analyzable tasks."

Many models look to hunt and hunt to lick a problem moreover erstwhile it's fruitless. "Models deficiency a reliable system to admit erstwhile a task is unsolvable and extremity early," wrote Bai and team. "Instead, they proceed exploring, retrying, and re-reading context, accumulating costs without progress."

Unable to foretell costs

Those factors make "token usage prediction and supplier pricing a fundamentally challenging task," wrote Bai and team. And, successful fact, nan bot itself cannot foretell erstwhile asked to "introspect," they found.

Bai and squad asked each AI supplier to foretell its tokens utilizing nan prompt: "I've uploaded a python codification repository successful nan directory illustration repo. You are a TOKEN ESTIMATION agent. Estimate nan token costs to hole nan pursuing rumor description," and past nan problem description, specified as, fixing a bug for a comparison usability successful codification that fails.

What they recovered is that agents tin approximate to a mini grade really galore tokens will beryllium used, but their predictions thin to beryllium excessively debased

"Models consistently underestimate nan tokens they need," wrote Bai and team. "The bias is particularly pronounced for input tokens, whose predictions enactment compressed moreover arsenic existent values turn into nan millions."

Watch those inputs

That past point, astir input tokens, has a typical prominence successful nan report. Bai and squad recovered that input tokens, specified arsenic what's typed by nan quality user, and what is retrieved via devices specified arsenic database searches, predominate nan costs successful tokens. The different 2 types of tokens, nan output, which is generated, and nan cached tokens held successful representation from anterior stages, are acold little demanding.

"Strikingly, input tokens, not output tokens, predominate nan wide costs successful agentic coding."

The logic is that "agentic workflows accumulate nan accusation from different sources and nan aforesaid discourse gets fed into nan models repeatedly." As a result, location is simply a "dramatically higher input/output ratio" for agentic AI than for single-prompt aliases multi-prompt AI sessions pinch a bot.

And, drilling down moreover further, nan astir costly input token facet is erstwhile nan supplier retrieves anterior accusation from memory. "We find that cache sounds predominate some earthy token measurement and dollar cost," Bai and squad wrote. "In each phase, cache-read input tokens are nan largest class by a wide separator (Figure 8a), reflecting nan cumulative reuse of anterior context."

There will beryllium a reckoning

Overall, nan study results confirm my anecdotal experience pinch coding agents specified arsenic Replit and Lovable, wherever nan metre was perpetually moving to usage nan underlying AI models, and I had nary consciousness of what nan full costs would be.

What tin beryllium done? The authors don't person galore suggestions. One connection is that moreover if agents can't foretell nan number of tokens, they tin make immoderate guesses astatine a precocious level, a "coarse-grained" estimate for token cost. "This suggests that agent-driven estimation tin perchance support early budget alerts before launching costly runs, improving costs transparency without overpromising precise token-level accuracy," they wrote.

I tin deliberation of a fewer different sensible guidelines.

Since input tokens are nan biggest costs element, 1 should deliberation cautiously astir what tin beryllium controlled astatine input. The size of prompts is 1 facet that drives input tokens higher. The discourse model utilized pinch an agent, wider aliases narrower, affects token count astatine input. And nan number of devices called by nan agent, specified arsenic databases, will bring tons much input tokens into play.

Also: Can a newbie really vibe codification an app? I tried Cursor and Replit to find out

There's only truthful overmuch you tin do arsenic a user, however. Something much will person to beryllium done connected an industry-wide basis. The problems outlined are intelligibly those of a young industry, and 1 wherever vendors will person to beryllium pushed by users to alteration practices.

The deficiency of transparency arsenic to what an supplier mightiness costs to do a task is measurement excessively vague for enterprises that request to beryllium capable to scheme investments successful software. The load is pushed onto nan personification to tally agentic tasks successful an experimental capacity complete and complete successful bid to get thing for illustration an mean costs to usage arsenic an estimate for readying purposes.

And nan deficiency of guarantees of occurrence -- moreover aft nan supplier burns done tokens -- is nan astir glaring problem. That intends enterprises could discarded immense amounts of money conscionable moving tokens.

Users collectively are going to person to push backmost connected vendors specified arsenic OpenAI, Google, and Anthropic and request value transparency and immoderate shape of guarantee that a task will beryllium completed, aliases other nan full workout of agentic AI whitethorn beryllium dominated by costs overruns and grounded implementations.

Such heavy problems are astir apt already being encountered by early adopters. They whitethorn beryllium contented to salary specified a precocious costs to beryllium among nan first to get an agentic edge. It's not a situation, however, that tin lead to stable, dependable usage of agentic AI.