Openai Tested Gpt-5, Claude, And Gemini On Real-world Tasks - The Results Were Surprising

2 months ago

OpenAI's caller information intends to measurement what AI is really doing for nan economy

Follow ZDNET: Add america arsenic a preferred source connected Google.

ZDNET's cardinal takeaways

AI's efficacy astatine activity is still proving lukewarm astatine best.
OpenAI's caller information measures its GDP effect successful definite tasks.
Companies are nether unit to warrant their tools' existence.

Despite truthful galore AI devices flooding nan market, promising accrued productivity and moreover afloat automated work, their effect truthful acold has been inconsistent astatine best. As a recent MIT report noted, 95% of endeavor AI projects person failed; elsewhere, bosses are getting unsatisfactory AI-generated "workslop" from their nonstop reports, adding hours of further labor.

Also: AI helps beardown dev teams and hurts anemic ones, according to Google's 2025 DORA report

OpenAI's caller evaluation, GDPval, intends to alteration that by "measuring really AI performs connected real-world, economically valuable tasks," nan institution said successful an announcement Thursday. Companies and third-party testers already usage industry benchmarks and different evaluations to find really tin models are astatine tasks for illustration coding and math. However, these tin thin much world than would beryllium realistic erstwhile models are deployed; GDPval intends to constrictive that spread betwixt mentation and practice.

What GDPval measures

GDPval measures really models tackle 1,320 tasks associated pinch 44 occupations -- mostly knowledge activity jobs -- crossed nan apical 9 industries that lend much than 5% to US gross home merchandise (GDP).

Using information from nan May 2024 US Bureau of Labor Statistics (BLS) and nan Department of Labor's O*NET database, OpenAI included immoderate expected professions, for illustration package engineers, lawyers, and video editors, arsenic good arsenic immoderate little commonly touched by AI arsenic of now, including detectives, pharmacists, and societal workers.

According to OpenAI, nan tasks were created by professionals pinch an mean of 14 years of acquisition successful applicable fields to bespeak "real activity products, specified arsenic a ineligible brief, an engineering blueprint, a customer support conversation, aliases a nursing attraction plan."

Also: The fastest increasing AI chatbot lately? It's not ChatGPT aliases Gemini

"Unlike different evaluations tied to economical worth which ore connected circumstantial domains (e.g., SWE-Lancer), GDPval covers galore tasks and occupations," OpenAI said. Rather than utilizing matter prompts, GDPval gives models files to reference and specifies multimodal deliverables for illustration slides and documents to simulate what users would expect of it successful a activity environment.

"This realism makes GDPval a much realistic trial of really models mightiness support professionals," OpenAI added.

How models are performing

OpenAI had knowledgeable professionals blindly people outputs from OpenAI's GPT-4o, o4-mini, o3, and GPT-5 models, arsenic good arsenic Anthropic's Claude Opus 4.1, Google's Gemini 2.5 Pro, and xAI's Grok 4. Graders unknowingly compared them pinch human-generated outputs.

OpenAI supplemented this pinch an "autograder" AI strategy that predicts really humans will measure deliverables. The institution said it will merchandise nan autograder arsenic an experimental investigation instrumentality here for those who want to effort it, though OpenAI cautions it's not arsenic reliable arsenic quality graders and won't beryllium replacing them anytime soon.

Also: How group really usage ChatGPT vs Claude - and what nan differences show us

"We recovered that today's champion frontier models are already approaching nan value of activity produced by manufacture experts," OpenAI wrote. "Claude Opus 4.1 was nan champion performing exemplary successful nan set, excelling successful peculiar connected aesthetics (e.g., archive formatting, descent layout), and GPT-5 excelled successful peculiar connected accuracy (e.g., uncovering domain-specific knowledge)."

The investigation besides showed capacity "more than doubled from GPT-4o (released outpouring 2024) to GPT-5 (released summertime 2025)," OpenAI added, indicating that exemplary abilities are improving rapidly.

The kicker, of course, is cost.

Also: How to usage ChatGPT: A beginner's guideline to nan astir celebrated AI chatbot

"We recovered that frontier models tin complete GDPval tasks astir 100x faster and 100x cheaper than manufacture experts," OpenAI wrote. "However, these figures bespeak axenic exemplary conclusion clip and API billing rates, and truthful do not seizure nan quality oversight, iteration, and integration steps required successful existent workplace settings to usage our models."

Caveats

In nan blog, OpenAI noted that GDPval is "an early measurement that doesn't bespeak nan afloat nuance of galore economical tasks." It only conducts one-off evaluations, meaning it can't measurement whether a exemplary could complete aggregate drafts of a task aliases successfully sorb discourse for an ongoing task. For example, GDPval presently can't measure whether a exemplary could successfully edit a little based connected customer feedback aliases redo information study astir an anomaly.

Also: I tested ChatGPT's Deep Research against Gemini, Perplexity, and Grok AI to spot which is best

OpenAI added nan important statement that activity successful nan existent world isn't ever trim and barren -- not each task comes pinch an organized group of files aliases a clear directive. The quality -- and profoundly contextual -- activity of exploring a problem done speech and dealing pinch ambiguity aliases shifting circumstances can't beryllium captured by thing for illustration GDPval astatine this stage.

"Most jobs are much than conscionable a postulation of tasks that tin beryllium written down," OpenAI said.

The institution added that early iterations will try, though, by spanning much industries and harder-to-automate tasks, for illustration those that impact interactive workflows aliases tons of anterior discourse (something AI agents, for example, presently struggle with). OpenAI said it will merchandise a subset of GDPval tasks for researchers to usage successful their ain activity and grow nan project.

What comes adjacent

OpenAI's conclusion from these results is much of what we've go utilized to hearing. AI will inevitably proceed to disrupt nan occupation market, arsenic it already has, and tin theoretically return connected busywork to free up workers' clip for much analyzable tasks.

"Especially connected nan subset of tasks wherever models are peculiarly strong, we expect that giving a task to a exemplary earlier trying it pinch a quality would prevention clip and money," OpenAI said, possibly unsurprisingly.

Also: Forget quiet quitting - AI 'workslop' is nan caller agency morale killer

Despite noting really competitory models person go pinch quality experts, OpenAI reiterated its acquainted line: that it plans to democratize entree to AI devices successful bid to support "supporting workers done change, and building systems that reward wide contribution."

"Our extremity is to support everyone connected nan 'up elevator' of AI," nan institution wrote -- which, contradicting caller surveys, assumes everyone is having that acquisition to statesman with.