Is Opus 4.5 Really 'the Best Model In The World For Coding'? It Just Failed Half My Tests

18 hours ago

Follow ZDNET: Add america arsenic a preferred source on Google.

ZDNET's cardinal takeaways

Opus 4.5 grounded half my coding tests, contempt bold claims
File handling glitches made basal plugin testing astir impossible
Two tests passed, but reliability issues still predominate nan story

I've sewage to show you: I've had reasonably okay coding results pinch Claude's lower-end Sonnet AI model. But for immoderate reason, its high-end Opus exemplary has ne'er done good connected my tests.

Usually, you expect nan super-duper coding exemplary to codification amended than nan inexpensive seats, but pinch Opus, not truthful much.

Also: Google's Antigravity puts coding productivity earlier AI hype - and nan consequence is astonishing

Now, we're backmost pinch Opus 4.5. Anthropic, nan institution down Claude claims, and I quote, "Our newest model, Claude Opus 4.5, is disposable today. It's intelligent, efficient, and nan champion exemplary successful nan world for coding, agents, and machine use."

The champion exemplary successful nan world for coding? No, it's not. At slightest not yet.

Those of you who've been pursuing on cognize that I person a standard group of 4 reasonably low-end coding tests I put nan AI models done connected a regular basis. They trial a bunch of very elemental skills and model knowledge, but they tin sometimes travel up nan AIs.

Also: How I trial an AI chatbot's coding expertise - and you can, too

I'll springiness you nan TL;DR correct now. Opus 4.5 collapsed and burned connected 1 test, turned successful a mediocre and not-quite-good-enough reply connected nan second, and passed nan remaining two. With a 50% score, we're decidedly not looking astatine "the champion exemplary successful nan world for coding."

Let's excavation in, and past I'll wrap up pinch immoderate thoughts.

Test 1: Writing a WordPress plugin

Test 1 asks nan AI to build a elemental WordPress plugin that presents an interface successful nan admin dashboard and past randomizes names. The only difficult portion is that if location is much than 1 matching name, they are separated, but each nan names still show successful nan list.

Also: The champion free AI for coding successful 2025 - only 3 make nan trim now

Opus 4.5 went to municipality penning this plugin. I've seen builds that were done successful a single, elemental PHP record and worked conscionable fine. But it is imaginable to usage a operation of PHP for nan backmost end, JavaScript for nan interactive bits, and CSS for styling. That's what Opus did.

Opus wrote a 312-line PHP file, a 178-line JavaScript file, and a 133-line CSS file. Or, astatine slightest it did nan 2nd clip around.

For its first trick, Opus 4.5 mixed each 3 files into 1 that it said I could download and simply install. Except I couldn't download nan file. I tried a fewer times, and Opus 4.5 kept responding pinch "Failed to download files."

Then I tried getting astatine nan files utilizing nan Files Workspace. I clicked connected "View nan Line Randomizer plugin folder" successful nan Opus 4.5 consequence window, only to get a large, quiet surface pinch nan building "No record contented available."

Okay, fine. After I pasted successful my original trial prompt, I watched Opus 4.5 show nan codification arsenic it was being generated. Once it finished, nan codification was hidden. Presumably, Opus 4.5 conscionable expected nan download to work.

To get astatine nan existent code, I had to inquire Opus 4.5:

Give maine each of nan 3 files separately, truthful I tin trim and paste them from here.

It did. The PHP codification was successful its ain small model area, wherever I could trim it retired and paste it into my matter editor. So was nan CSS code. But nan JavaScript codification included immoderate archiving (not commented out) astir nan recommended record structure.

Had I not quickly taken a look astatine nan full file's codification to spot what it was doing, I mightiness person conscionable tried moving that. Without a doubt, that would person resulted successful a fail.

Also: OpenAI's Codex Max solves 1 of my biggest AI coding annoyances - and it's a batch faster

There was, however, immoderate bully news. After each that fussing and removing nan spurious archiving lines that would person killed it, I did negociate to get nan WordPress plugin to load and coming a personification interface.

Given that it was being styled by 133 lines of CSS, you would deliberation it mightiness look a small better, but hey, astatine slightest thing worked. Well, not really.

Once I pasted successful my trial names, I clicked connected Randomize Lines. Nothing happened. Clear All didn't activity either.

Also: How to vibe codification your first iPhone app pinch AI - nary acquisition necessary

Let's recap conscionable really galore ways this failed. It wouldn't download erstwhile it told maine it was giving maine a download link. Then I asked for nan codification separately to trim and paste. It mixed nan chatbot consequence into nan code. Then, erstwhile I pulled that retired and ran nan test, nan existent tally didn't work. It presented a UI, but wouldn't really do nan code.

As nan Mythbusters utilized to say, "Failure is ever an option."

Test 2: Rewriting a drawstring usability

Test 2 asks nan AI to hole a elemental spot of JavaScript that incorrectly validates nan introduction of dollars and cents currency. What I provender nan AI is codification that won't let for immoderate cents. It's expected to springiness backmost moving code.

The thought of this usability is that it checks for personification input. It was primitively successful a aid plugin, truthful its occupation was to make judge nan philanthropist was really typing successful an magnitude that could beryllium qualified arsenic a aid amount, and wouldn't break connected personification entering letters aliases numbers incorrectly.

Also: How to usage ChatGPT to constitute codification - and my apical instrumentality for debugging what it generates

The codification Opus 4.5 gave backmost rejected excessively galore separator lawsuit examples. It didn't let "12." (two digits followed by a decimal point), though that would intelligibly activity arsenic $12. It didn't let for ".5," though that would intelligibly activity for 50 cents. It didn't for illustration "000.5", though it did judge "0.5". And if personification typed "12.345" it didn't chop disconnected nan past half a cent (or information it up). It conscionable rejected nan entry.

Oh, and if location was nary worth passed to it, aliases nan drawstring worth it was asked to trial was really null (an quiet value), nan codification would crash. Not conscionable return an error, but crash.

That gives "the champion exemplary successful nan world for coding" its 2nd failure.

Tests 3 and 4

Test 3 asks nan AI to place what's causing a bug successful code, but it requires reasonably bully model knowledge of really PHP and WordPress work. It's a multi-step analysis, wherever what seems evident isn't nan problem. The bug is baked deeper into really nan model works.

Opus 4.5 passed this trial conscionable fine.

Also: Why AI coding devices for illustration Cursor and Replit are doomed - and what comes next

Test 4 asks nan AI to activity pinch 3 programs: AppleScript, Chrome, and a inferior called Keyboard Maestro. Basically, it's asking Keyboard Maestro to interact pinch AppleScript to find and activate a circumstantial tab successful Chrome.

Surprisingly, because this trial often trips up nan AIs, Opus 4.5 aced this question. It understood Keyboard Maestro, and it didn't make nan accustomed lawsuit sensitivity errors different AIs person made successful nan past.

Bottom statement for Opus 4.5

Opus 4.5 is expected to beryllium Anthropic's expansive work. In nan agentic situation pinch Claude Code, and supervised by a master programmer consenting to inquire Claude to rewrite its consequence until nan codification works, it mightiness beryllium beautiful good.

I've been utilizing Claude Code and Sonnet 4.5 successful nan agentic terminal interface pinch beautiful awesome results. But nan results are not ever correct. I person to nonstop Claude backmost to activity three, four, five, six, moreover 10 times sometimes to get it to springiness maine a workable answer.

Here, for this article, I conscionable tested Opus 4.5 successful nan chatbot. I did nonstop it backmost erstwhile to springiness maine codification I could really access. But overall, it grounded 50% of nan time. Plus, successful my first test, it demonstrated really it conscionable wasn't fresh for a elemental chatbot interface.

Also: GitHub's caller Agent HQ gives devs a bid halfway for each their AI devices - why this is simply a immense deal

I'm judge Anthropic will amended this complete time, but arsenic of today, I surely can't study that Opus 4.5 is fresh for premier time. I changeable a statement retired to Anthropic asking for comment. If nan institution gets backmost to me, I'll update this article pinch its response.

Stay tuned.

Have you tried Opus 4.5 aliases immoderate of Anthropic's different models for hands-on coding work? How do your results comparison pinch what I recovered here? Have you tally into akin issues pinch record handling aliases codification reliability, aliases has your acquisition been smoother? And wherever do you deliberation these "best exemplary successful nan world for coding" claims onshore based connected your ain testing? Share your thoughts successful nan comments below.

You tin travel my day-to-day task updates connected societal media. Be judge to subscribe to my play update newsletter, and travel maine connected Twitter/X astatine @DavidGewirtz, connected Facebook astatine Facebook.com/DavidGewirtz, connected Instagram astatine Instagram.com/DavidGewirtz, connected Bluesky astatine @DavidGewirtz.com, and connected YouTube astatine YouTube.com/DavidGewirtzTV.