In Brief
Posted:
12:26 PM PST · February 6, 2026
Image Credits:TechCrunch / Getty ImagesLast month, I wrote astir Mercor’s caller benchmark measuring AI agents’ capabilities connected master tasks for illustration rule and firm analysis. At nan time, nan scores were beautiful dismal, pinch each awesome laboratory scoring nether 25%, truthful we concluded lawyers were safe from AI displacement, astatine slightest for now.
But AI capabilities tin alteration a batch successful a mates of weeks.
This week’s merchandise of Opus 4.6 shook up the leaderboards, pinch Anthropic’s caller exemplary scoring conscionable awkward of 30% successful one-shot trials, and an mean of 45% erstwhile fixed a fewer much cracks astatine nan problem. Notably, nan merchandise included a bunch of caller agentic features, including “agent swarms,” which whitethorn person helped pinch this benignant of multi-step problem-solving.
Regardless, nan people is simply a immense jump from nan erstwhile state-of-the-art, and a motion that advancement connected instauration models isn’t slowing down. Mercor CEO Brendan Foody, who was peculiarly impressed, said, “jumping from 18.4% to 29.8% successful a fewer months is insane.”
The APEX-Agents LeaderboardThirty percent is still a agelong measurement from 100%, truthful it’s not for illustration lawyers request to beryllium worried astir getting replaced by machines adjacent week. But they should beryllium a batch little assured than they were past month!
Subscribe for nan industry’s biggest tech news
3 hours ago
English (US) ·
Indonesian (ID) ·