Microsoft Built A Fake Marketplace To Test Ai Agents — They Failed In Surprising Ways

Trending 3 days ago
a moshed/glitchy type of a Windows logo connected a Microsoft Store frontImage Credits:David Ryder / Bloomberg (PhotoMosh/modified) / Getty Images

9:00 AM PST · November 5, 2025

On Wednesday, researchers astatine Microsoft released a caller simulation situation designed to trial AI agents, on pinch caller investigation showing that existent agentic models whitethorn beryllium susceptible to manipulation. Conducted successful collaboration pinch Arizona State University, nan investigation raises caller questions astir really good AI agents will execute erstwhile moving unsupervised — and really quickly AI companies tin make bully connected promises of an agentic future.

The simulation environment, dubbed nan “Magentic Marketplace” by Microsoft, is built arsenic a synthetic level for experimenting connected AI supplier behavior. A emblematic research mightiness impact a customer-agent trying to bid meal according to a user’s instructions, while agents representing various restaurants compete to triumph nan order.

The team’s first experiments included 100 abstracted customer-side agents interacting pinch 300 business-side agents. Because nan root codification for nan marketplace is open-source, it should beryllium straightforward for different groups to adopt nan codification to tally caller experiments aliases reproduce findings.

Ece Kamar, managing head of Microsoft Research’s AI Frontiers Lab, says this benignant of investigation will beryllium captious to knowing nan capabilities of AI agents. “There is really a mobility astir really nan world is going to alteration by having these agents collaborating and talking to each different and negotiating,” said Kamar. “We want to understand these things deeply.”

The first investigation looked astatine a operation of starring models, including GPT-4o, GPT-5 and Gemini-2.5-Flash, and recovered immoderate astonishing weaknesses. In particular, nan researchers recovered respective techniques businesses could usage to manipulate customer-agents into buying their products. The researchers noticed a peculiar falloff successful ratio arsenic a customer-agent was fixed much options to take from, overwhelming nan attraction abstraction of nan agent.

“We want these agents to thief america pinch processing a batch of options,” Kamar says. “And we are seeing that nan existent models are really getting really overwhelmed by having excessively galore options.”

The agents besides ran into problem erstwhile they were asked to collaborate toward a communal goal, apparently unsure of which supplier should play what domiciled successful nan collaboration. Performance improved erstwhile nan models were fixed much definitive instructions connected really to collaborate, but nan researchers still saw nan models’ inherent capabilities arsenic successful request of improvement.

Techcrunch event

San Francisco | October 13-15, 2026

“We tin instruct nan models — for illustration we tin show them, measurement by step,” Kamar said. “But if we are inherently testing their collaboration capabilities, I would expect these models to person these capabilities by default.”

Russell Brandom has been covering nan tech manufacture since 2012, pinch a attraction connected level argumentation and emerging technologies. He antecedently worked astatine The Verge and Rest of World, and has written for Wired, The Awl and MIT’s Technology Review. He tin beryllium reached astatine russell.brandom@techcrunch.com aliases connected Signal astatine 412-401-5489.

More