A New Ai Coding Challenge Just Published Its First Results – And They Aren’t Pretty

1 day ago

Blue codification connected a acheronian inheritance presented astatine an angle.

5:00 PM PDT · July 23, 2025

A caller AI coding situation has revealed its first victor — and group a caller barroom for AI-powered package engineers.

On Wednesday astatine 5pm PST, nan nonprofit Laude Institute announced nan first victor of nan K Prize, a multi-round AI coding situation launched by Databricks and Perplexity co-founder Andy Konwinski. The victor was a Brazilian punctual technologist named Eduardo Rocha de Andrade, who will person $50,000 for nan prize. But much astonishing than nan triumph was his last score: he won pinch correct answers to conscionable 7.5% of nan questions connected nan test.

“We’re gladsome we built a benchmark that is really hard,” said Konwinski. “Benchmarks should beryllium difficult if they’re going to matter.” Konwinski has pledged $1 cardinal to nan first open-source exemplary that tin people higher than 90% connected nan test.

Similar to nan well-known SWE-Bench system, nan K Prize tests models against flagged issues from GitHub arsenic a trial of really good models tin woody pinch real-world programming problems. But while SWE-Bench is based connected a fixed group of problems that models tin train against, nan K Prize is designed arsenic a “contamination-free type of SWE-Bench,” utilizing a timed introduction strategy to defender against immoderate benchmark-specific training. For information one, models were owed by March 12th. The K Prize organizers past built nan trial utilizing only GitHub issues flagged aft that date.

The 7.5% apical people stands successful marked opposition to SWE-Bench itself, which presently shows a 75% apical people connected its easier ‘Verified’ trial and 34% connected its harder ‘Full’ test. Konwinski still isn’t judge whether nan disparity is owed to contamination connected SWE-Bench aliases conscionable nan situation of collecting caller issues from GitHub, but he expects nan K Prize task to reply nan mobility soon.

“As we get much runs of nan thing, we’ll person a amended sense,” he told TechCrunch, “because we expect group to accommodate to nan dynamics of competing connected this each fewer months.”

It mightiness look for illustration an overseas spot to autumn short, fixed nan wide scope of AI coding devices already publically disposable – but pinch benchmarks becoming excessively easy, galore critics spot projects for illustration nan K Prize arsenic a basal measurement toward solving AI’s increasing information problem.

Techcrunch event

San Francisco | October 27-29, 2025

“I’m rather bullish astir building caller tests for existing benchmarks,” says Princeton interrogator Sayash Kapoor, who put guardant a akin thought in a caller paper. “Without specified experiments, we can’t really show if nan rumor is contamination, aliases moreover conscionable targeting nan SWE-Bench leaderboard pinch a quality successful nan loop.”

For Konwinski, it’s not conscionable a amended benchmark, but an unfastened situation to nan remainder of nan industry. “If you perceive to nan hype, it’s for illustration we should beryllium seeing AI doctors and AI lawyers and AI package engineers, and that’s conscionable not true,” he says. “If we can’t moreover get much than 10% connected a contamination free SWE-Bench, that’s nan reality cheque for me.”

Russell is simply a freelance writer based successful New York.