My 8 Chatgpt Agent Tests Produced Only 1 Near-perfect Result - And A Lot Of Alternative Facts

1 month ago

Last week, OpenAI unveiled Agent, its caller instrumentality that combines nan capabilities of Deep Research and Operator. Operator was OpenAI's first effort astatine a computer-using model, a exemplary that really tin unfastened windows and click connected personification interface elements. ChatGPT Agent tin do that and more.

Right now, ChatGPT Agent is only disposable for $200/mo Pro tier subscribers and provides for 400 supplier interactions per month. When nan $20/mo Plus tier gains entree to Agent, which should beryllium today, those users will get 40 interactions per month.

Also: Microsoft is redeeming millions pinch AI and laying disconnected thousands - wherever do we spell from here?

(Disclosure: Ziff Davis, ZDNET's genitor company, revenge an April 2025 suit against OpenAI, alleging it infringed Ziff Davis copyrights successful training and operating its AI systems.)

I upgraded my scheme from Plus to Pro conscionable truthful I could trial retired nan caller Agent mode and study backmost to you. In this article, I'll show you elaborate results from 8 broad tests.

TL;DR trial results

Before we spell into nan elaborate tests, I'll commencement pinch immoderate wide TL;DR observations.

Test count: In nan past 2 days, I utilized 25 of nan disposable 400 queries, for a full of almost 12 hours of hyper-uber-supercomputer use. No wonderment this point costs $200/month.

Also: I recovered 5 AI contented detectors that tin correctly place AI matter 100% of nan time

Nearly each query required a follow-on, truthful erstwhile it comes clip for Plus users, don't presume you tin springiness Agent 40 projects. More likely, you'll beryllium giving it 20-25, and utilizing nan remainder of your queries to person nan Agent to travel directions.

Result quality: In each my tests, Agent appeared to understand nan problem. But it grounded to nutrient useful results for astir of nan tests. That said, nan last trial produced results that tin only beryllium characterized arsenic amazingly useful.

Project scale: Agent can't grip large projects, nan benignant of information study projects you really want an AI to beryllium capable to handle. It has problem scrolling done web pages. It can't sojourn sites that person AI aliases robots.txt restrictions successful place. And agelong processing exceeds convention clip allocations, moreover pinch nan ace top-of-the-line gold-pressed latinum Pro edition.

Presentation quality: One of nan awesome transportation points for Agent is its expertise to create spreadsheets and presentations. It did okay pinch spreadsheets, but nan schematic value of nan presentations was beautiful rough. I expect this to alteration complete time, but don't expect Agent to make presentations you tin usage without sizeable cleanup.

Accuracy: AIs hallucinate. The OpenAI squad cautioned astir utilizing Agent because of nan caller risks involved. While I did get backmost immoderate results that were accurate, Agent besides came backmost pinch unforced errors, results it could person easy tested and deemed inaccurate. But nary specified verification aliases validation occurred. That said, nan last trial was meticulous and shows what this tech tin do erstwhile it works.

Connectors: Agent comes pinch nan expertise to usage connectors (via API calls) to nexus to Gmail, Google Calendar, Google Drive, Outlook, Dropbox, and more. I did not trial retired nan connectors because of really often Agent hallucinates aliases does thing reasonably boneheaded. I conscionable didn't consciousness comfortable capable to springiness Skynet entree to my accounts. At least, not yet.

Limits: I was incapable to usage Agent successful nan MacOS app. I besides recovered that Agent stalled difficult erstwhile I tried to tally it successful aggregate Chrome tabs astatine once. For now, you motorboat an Agent process and wait. It's not for illustration Codex, wherever you tin motorboat a bunch of projects and travel backmost later and harvest each nan results. But since that capacity exists successful Codex, I'm judge it will show up soon successful Agent.

That should springiness you a beautiful bully overview. Let's get started looking astatine nan 8 trial results. For each result, I've included a nexus to nan convention recording, truthful you tin spot nan prompts I used, nan elaborate results, and watch Agent logic its measurement done nan problem.

Also, decidedly publication to nan end. Some of nan early results are reasonably bad, but nan past 1 knocks it retired of nan park. And pinch that, present we go.

1. Selecting products connected Amazon

Understanding of nan problem: Solid
Execution: Both bully and bad
Hallucination: Weird religion reference, clone Amazon links
Processing time: 20 + 12 minutes

When OpenAI introduced ChatGPT Agent, nan squad demoed really they utilized nan instrumentality to shop for wedding apparel and a wedding gift. That seemed for illustration a reasonably uncommon and impractical exertion for a super-intelligence, particularly since gift registries beryllium and are wide used.

Instead, I gave Agent a purchasing task I had really extensively researched and completed a fewer months earlier. I'm moving Power-over-Ethernet cables each crossed my gait to upgrade my information system. As such, I'm creating a batch of civilization cables. I already cognize that doing truthful requires immoderate cardinal tools: a cutter to portion nan cable, a cablegram extremity stripper, a crimper to connect nan RJ-45 ends, and a tester to corroborate that agelong cablegram runs work.

Also: How a circuit breaker finder helped maine representation my home's wiring (and why that matters)

I gave Agent a punctual asking for 3 configurations: a fund toolset, a "money-is-no-object" solution, and a saccharine spot solution. I asked for links, merchandise descriptions, and merchandise images.

Once you springiness Agent your prompt, it creates a virtual desktop. You tin watch it conducting its activities, jumping betwixt a desktop view, a matter view, and code.

The fund solution turned retired to beryllium a win. Agent recovered a single $34 kit pinch everything I asked for. It presented a link, and moreover reasoning why it chose that solution. Unfortunately, nan image it provided was thing for illustration nan existent kit.

The mid-tier and top-tier solutions were little than perfect. None of nan links worked. The mid-tier saccharine spot solution did person a product-accurate image, but without a link, it wasn't really helpful.

Unfortunately, nan exemplary recommended doesn't really beryllium connected Amazon. In fact, nary of nan mid- aliases upper-tier products beryllium connected Amazon. It looks for illustration Agent did a heap of web surfing to find nan products, disregarding my instructions to hunt only connected Amazon.

It besides intelligibly visited different sites, astir apt gathering exemplary names and descriptions.

Then, erstwhile it packaged up its last recommendations, it conscionable assigned random Amazon links to nan description, moreover though those products and those links don't look to beryllium connected Amazon.

I did petition it spell backmost and effort again. When it did, aft 12 minutes, it presented astir of nan aforesaid products, though 1 of nan links that had grounded earlier did, successful fact, constituent to a merchandise connected Amazon successful nan 2nd run.

Also: Coding pinch AI? My apical 5 tips for vetting its output - and staying retired of trouble

I can't time off this conception without pointing retired thing conscionable plain weird. As I was watching Agent work, it presented this successful its desktop view. I don't moreover want to know.

You tin watch a replay of nan full convention here.

2. Comparing ovum prices

Understanding of nan problem: Solid
Execution: Did what I asked
Hallucination: My responsibility for imprecise prompting
Processing time: 14 minutes

In discussing ChatGPT Agent, OpenAI showed a descent that mentioned Instacart arsenic 1 of nan examples that nan chatbot is comfortable moving with. Since my family regularly uses Instacart, I decided to group Agent loose and spot what it could show maine astir ovum prices astatine our section stores.

I didn't fto Agent person entree to my account, but I shared my ZIP codification present successful Salem, Oregon. I told it to "Please sojourn each nan market stores connected Instacart and comparison ovum prices."

Also: How to usage ChatGPT to constitute codification - and my apical instrumentality for debugging what it generates

It did precisely that. You've heard nan building Garbage In, Garbage Out. Well, that's what happens erstwhile you inquire an AI to look astatine "all nan market stores." I should person asked it to look successful a 5 aliases 10 mile radius only. But I didn't.

Agent came backmost pinch 21 stores, ranging from adjacent to up to almost 47 miles away. It did execute what I asked, comparing ovum prices. Without prompting, it decided to rank nan eggs by price. This was good. But erstwhile it chose nan eggs to rank, it didn't ever take nan slightest costly merchandise from each store.

For example, it recommended nan Good & Gather eggs from Target astatine $2.99 a dozen, alternatively than nan $1.99/dozen Market Pantry egg, besides from Target.

You tin watch a replay of nan full convention here.

3. Creating a PowerPoint descent

Understanding of nan problem: Solid
Execution: Added nan correct information point
Hallucination: Was incapable to reproduce schematic quality
Processing time: 10 minutes

Next up is simply a task I did early past week. With Congress focusing connected Bitcoin, my editor asked maine to update my Bitcoin finance article, wherever I've been search nan worth of a $50 Bitcoin finance since 2022.

The worth of my holdings went up, which intends I needed to adhd a caller slide. Each descent adds a day worth connected nan X axis and a worth constituent connected nan Y axis. From a PowerPoint fiddling standpoint, that meant moving complete nan graphics to make room for nan caller worth and, successful this case, adjusting nan vertical standard to accommodate a important emergence successful value.

Also: The champion free AI courses

When I did it, it took maine astir 45 minutes. Since OpenAI said that PowerPoint was 1 of ChatGPT Agent's strengths, I wanted to spot if Agent could prevention maine that clip successful nan future.

I uploaded my existing descent platform minus nan past descent I made for nan article. Then I asked Agent to create that descent for me.

As it worked, nan desktop position showed nan terminal interface. You tin spot really Agent is putting together nan codification to make a schematic image.

Here's what that descent should person looked for illustration (note: foreshadowing).

Here's what Agent gave me.

To beryllium fair, Agent intelligibly understood nan problem. It moved nan existing information points complete to nan near to make room for nan caller node. It besides placed nan caller Bitcoin point decently successful narration to nan existing ones, and added some value and percent alteration matter blocks.

That intends Agent publication and understood nan discourse of my PowerPoint deck's layout. That, successful and of itself, is very impressive.

Also: The champion AI for coding successful 2025 (and what not to use)

But it grounded connected adding much standard lines and caller Y-axis values. It grounded connected reproducing nan fonts. It grounded connected decently placing nan matter blocks. And it pushed nan full schematic up and to nan near of nan slide.

I'm guessing nan graphics room that Agent uses isn't really up to nan task of making good schematic changes. That will undoubtedly amended complete time.

You tin watch a replay of nan full convention here.

4. Article categorization (method II)

Understanding of nan problem: Solid
Execution: Failed owed to exceeding allowable convention time
Hallucination: Gave maine backmost partial results
Processing time: 8 minutes + 3 minutes + 21 minutes

Each week for nan past 2 years, I've published a newsletter that shares pinch followers nan articles I published present connected ZDNET for nan week. Each newsletter contains a title, link, and article description.

By pointing Agent to my backmost rumor archive, it would person adjacent to 300 article summaries to categorize.

Unfortunately, Agent ran into a number of problems of its ain making. It was incapable to successfully scroll done nan article database utilizing JavaScript. When I told it to usage nan web interface, it started to, but it reported, "Unfortunately, I've reached nan extremity of nan allotted browsing sessions for this task, which intends I'm incapable to research further pages and cod nan further information astatine this time."

Also: Is ChatGPT Plus really worthy $20 erstwhile nan free type offers truthful galore premium features?

Remember, I'm paying $200 a period for OpenAI's champion plan, and it still won't springiness maine capable clip to look up 300 articles. That's a gotcha, correct there. It's besides disappointing because a task for illustration scrolling backmost done an article archive and doing immoderate tabulating is precisely nan benignant of task you mightiness springiness to an assistant. If nan AI gives up because it takes excessively long, past we can't really trust connected AI for each nan adjunct type things. No 1 wants a fussy, picky assistant.

In immoderate case, Agent did springiness maine backmost a spreadsheet and a descent based connected nan constricted information it was capable to find earlier my small petition exceeded nan hourly powerfulness fund for nan City of Las Vegas (or truthful I imagine).

You tin watch a replay of nan full convention here.

5. Extract remembered matter from video

Understanding of nan problem: Partial
Execution: Didn't return afloat transcript connected first run, correct connected 2nd run
Hallucination: Decided to do what it wanted connected first run
Processing time: 2 minutes

I watch a batch of YouTube videos to augment my learning and research. Plus thing thumps a bully relaxing video astir how pavers are made. While it's reasonably easy to get a transcript of a afloat video, whether straight from YouTube aliases using Apple Voice Memos, locating wherever successful a video a conception you want to research tin return time.

Here's an example. When OpenAI introduced Agent successful a video, CEO Sam Altman discussed immoderate of nan cautions and warnings astir utilizing ChatGPT Agent mode. I did retrieve they were adjacent nan extremity of nan video, but I didn't want to walk clip sifting done to get nan nonstop quotes.

Instead, I delegated that duty to Agent. On its first run, it recovered nan conception easy enough, but alternatively of returning a word-for-word transcript, it returned immoderate quotes, interspersed pinch its ain analysis.

Also: I mapped my iPhone's Control Button to ChatGPT - present are 5 ways I usage it each day

I clarified what I wanted and, connected its 2nd run, it gave maine precisely what I needed. In this case, though, it wasn't that my punctual was unclear. I conscionable had to insist a 2nd clip that I wanted a transcript for nan AI to do what I asked.

Unfortunately, this other reappraisal rhythm diminished nan time-saving worth to me. I still deliberation utilizing Agent was faster than if I sifted done nan video myself. But I had to conception a 2nd punctual and hold for a 2nd result, each of which took my time.

Still, this is simply a adjuvant tool.

You tin watch a replay of nan full convention here.

6. Creating a inclination study position

Understanding of nan problem: Solid
Execution: Good, isolated from for descent ocular quality
Hallucination: Too overmuch information to corroborate aliases contradict assertions
Processing time: 32 minutes

As portion of my job, it's important to beryllium capable to support up pinch ongoing tech and business trends. As such, I often walk days successful heavy dives, coming up to velocity connected caller topics.

I wanted to spot if ChatGPT Agent could prevention maine immoderate clip by preparing a study and a afloat position connected distant activity trends. I told it that nan PowerPoint was destined for my guidance team, truthful it should beryllium broad and professional-looking.

It returned an study archive very akin to nan results we've been getting from ChatGPT heavy research. The study contains a ample number of assertions and statistical claims, astir of which I don't person clip to investigation for confirmation.

Also: ChatGPT tin record, transcribe, and analyse your meetings now

Most of nan top-level conclusions are congruent pinch my knowing of existent work-from-home trends. That said, we're acquainted pinch nan model's propensity for hallucination, truthful I'd beryllium very concerned astir utilizing immoderate of this information professionally without further vetting.

Agent did nutrient a 17-slide PowerPoint platform that was organized rather well. As pinch erstwhile experiments, nan schematic procreation value was a spot off. The first descent really looks rather good.

But later successful nan deck, it doesn't look right. Notice really nan pursuing descent has graphics connected apical of text, and bullets successful beforehand of bullets connected apical of quiet bullets.

In nan pursuing slide, not only is nan matter moving disconnected nan extremity of nan page, but there's nary legend. As such, it's not clear what's represented by reddish and by blue.

Once again, you tin spot really Python is utilized to conception nan deck.

Agent does a adjacent job, truthful I'm reasonably assured that nan AI will get amended complete time. Programmatic building of slides based connected templates is not a caller technology. I conscionable don't deliberation OpenAI prioritized descent position aesthetics arsenic portion of this release.

You tin watch a replay of nan full convention here.

7. Vetting a position for accuracy

Understanding of nan problem: Solid
Execution: Good
Hallucination: Seems complete, but it's still from an AI
Processing time: 11 minutes + 7 minutes

Well, this was conscionable plain fun. I decided to springiness nan position created successful nan erstwhile trial to a caller fresh ChatGPT Agent convention and asked it to validate nan claims.

Agent concluded, "Several quantitative claims—especially those concerning productivity/innovation impacts, nan size and maturation of nan gig economy, rates of side‑gig participation, and nan power of authorities and culture—could not beryllium verified pinch accessible grounds during this review."

Agent provided a elaborate study of each assertion. I've summarized nan results below.

Adoption timeline: Mostly confirmed
Global comparison: Confirmed
Workforce composition: Confirmed
Migration: Confirmed
Mobility of distant workers: Confirmed
Housing & section economies: Confirmed
Office vacancy & biology impacts: Mostly confirmed
Social connections & wellbeing: Partly confirmed
Employer attitudes & return‑to‑office mandates: Mostly confirmed
Employee preferences & salary cuts: Mostly confirmed
Productivity & innovation: Partly confirmed
Gig system & freelancing: Unverified
Freelancing motivations & challenges: Not strictly actual claims
Side gigs & aggregate jobs: Unverified
Demographics & equity: Partly confirmed / mixed
Political & taste influences: Partly confirmed / mostly unverified
Other factors & argumentation landscape: Generally meticulous but qualitative

As you tin see, of nan 17 information points, Agent considered only 5 to beryllium afloat confirmed. Contrast this pinch really GPT-4o analyzed nan results. When GPT-4o was fixed nan aforesaid PowerPoint deck, it considered each assertions to beryllium confirmed. You tin spot GPT-4o's detailed results here.

Even though I utilized nan AI to validate nan AI, I astir apt wouldn't beryllium comfortable utilizing immoderate of nan presumed facts successful my activity without personal, Mark I Eyeball confirmation. Still, it was a nosy exercise, and fascinating to spot really different nan results were betwixt ChatGPT Agent and ChatGPT 4o.

You tin watch a replay of nan full convention here.

8. Analyze building codification for obstruction installation

Understanding of nan problem: Solid
Execution: Pretty adjacent to perfect
Hallucination: None. It sewage each but 1 schematic conscionable right
Processing time: 4 minutes

Back erstwhile we lived successful Palm Bay, Florida, we lived connected a area property. The location came pinch what could only charitably beryllium called a fence. We needed to switch it, and since we wanted privacy, we wanted to spot conscionable really overmuch obstruction we could legally install.

Over nan people of a mates of years, I spent a ton of clip going backmost and distant pinch nan readying agency successful an effort to some understand what I could do pinch a fence, and what different alternatives mightiness beryllium disposable to me.

Since I person a batch of history pinch this task and americium very acquainted pinch Palm Bay codes (even years aft moving away), I decided to constituent ChatGPT Agent astatine nan problem.

It took each of 4 minutes to supply a detailed, meticulous analysis. It moreover created moving diagrams that illustrated nan options. Based connected my experience, I cognize nan results to beryllium accurate.

ChatGPT Agent produced output that could beryllium utilized to return this task to nan adjacent step. Back erstwhile I lived successful Palm Bay, nan balanced astir apt took maine 20 calls, a ton of emails, and a fewer visits to City Hall to travel up pinch options. The level of position and statement I came up pinch wasn't moreover close.

If Agent tin up its crippled elsewhere to beryllium connected a par pinch this test, past it will person immoderate legs.

You tin watch a replay of nan full convention here.

What's it each mean?

Well, it judge arsenic heck isn't sentient yet. At best, it's for illustration that administrative adjunct you hired because your mom said you had to prosecute her cousin's unemployable slacker kid. There are occasional flashes of brilliance, but mostly nan output seems for illustration nan consequence of some aggressively pursuing directions and purposely inventing replacement facts.

Is it worthy $200/month for nan Pro program? Not for Agent. At slightest not yet. Agent is unreliable and mostly performs reasonably poorly. In a twelvemonth aliases so, I'm judge it will get better. But now? No. The only logic to walk $200 a period connected it is to do what I'm doing: testing it to spot wherever nan exertion is today.

Stay tuned, because contempt each nan inaccuracies and problem areas, this decidedly shows wherever AI exertion could go. Of course, if a web browsing AI Agent is nan future, and each nan contented sites retired location artifact it because AI is stealing our content, past we'll person a very absorbing problem.

Also: I'm an AI devices expert, and these are nan only 2 I salary for (plus 3 I'm considering)

It's early days, folks. Whether this is simply a exertion that will beryllium a boon to each humanity aliases a exertion that destroys nan net and kills america successful our slumber remains to beryllium seen.

But hey, successful nan meantime, I and nan remainder of nan ZDNET squad will beryllium trying to make consciousness of it each for you. So support coming back. We'll person much to show you. I'll beryllium tinkering pinch Agent and I'm judge I'll person much to opportunity arsenic well.

Have you tried ChatGPT Agent yet? If so, did it travel your instructions accurately aliases veer disconnected into its ain mentation of nan task? Did it hallucinate aliases deed nan mark? How do you consciousness astir giving AI devices entree to your files, accounts, aliases browser? Are you seeing much worth successful this benignant of automation, aliases are you still waiting for it to go useful? Let america cognize successful nan comments below.

You tin travel my day-to-day task updates connected societal media. Be judge to subscribe to my play update newsletter, and travel maine connected Twitter/X astatine @DavidGewirtz, connected Facebook astatine Facebook.com/DavidGewirtz, connected Instagram astatine Instagram.com/DavidGewirtz, connected Bluesky astatine @DavidGewirtz.com, and connected YouTube astatine YouTube.com/DavidGewirtzTV.