ChatGPT Boasts a 91% Failure Rate for Office Tasks

Researchers have put various AI LLMs to the test to see how well they can handle office tasks. ChatGPT and others didn’t fare well.

One of the long running themes I’ve seen throughout my news writing career is narratives running up against reality and how often reality tends to win out over talking points. Things like Facebook having no choice but to pay for linking in Canada because they just can’t live without news links on their platforms. Then reality hits. The Digital Services Tax is totally in line with Canada’s international trade obligations and there’s nothing the US can do. Then reality hits. I could go on, but this has been a long-running theme I get to witness play out over and over again.

Another narrative I see from time to time is about AI. The narratives vary, but can include AI is taking over essay writing, AI is replacing engineers, or even how AI is going to cause humanity to go extinct. The overall impression is this idea that AI is exceeding human capabilities in every way and we are on the cusp of humans running out of work any day now. Aaaaany day now.

Yet, whenever people practically put AI to the test, firmly believing these narratives, a very different reality emerges. Examples include fake legal citations in a legal brief, the CNET Scandal, the Gannet scandal, bad journalism “predictions”, fake news stories, more fake news stories, Google recommending people to eat rocks, the 15% success rate story, bad chess tactics, the Chicago-Sun fiasco, a Canadian legal team getting in trouble with a judge over fake legal citations, and, more recently, another legal team discovering the hard way why AI written legal briefs suck.

Some die hard AI doomers or supporters will look at some of those examples and say, “Hey, some of those examples are years old. The technology has improved since then. Well, we can fast forward to today and see how things are looking, but it appears that the story has not changed. A research paper was put together to find out how well AI is able to handle office tasks. The results? They weren’t good at all. From Futurism:

One of the flashier bits of tech attracting investors are “AI agents,” which are software product designed to complete multi-part tasks on behalf of their human taskmasters. Tech companies and big corporations have spilled tankers of ink hyping up these agents, insisting they will “replace knowledge work” and bring about a “fundamental shift in how businesses operate.”

But despite these lofty promises and the money behind them, there’s mounting evidence that AI agents are just the latest bit of empty tech industry promises.

In May, researchers at Carnegie Mellon University released a paper showing that even the best-performing AI agent, Google’s Gemini 2.5 Pro, failed to complete real-world office tasks 70 percent of the time. Factoring in partially completed tasks — which included work like responding to colleagues, web browsing, and coding — only brought Gemini’s failure rate down to 61.7 percent.

And the vast majority of its competing agents did substantially worse.

OpenAI’s GPT-4o, for example, had a failure rate of 91.4 percent, while Meta’s Llama-3.1-405b had a failure rate of 92.6 percent. Amazon’s Nova-Pro-v1 failed a ludicrous 98.3 percent of its office tasks.

Meanwhile, a recent report by Gartner, a tech consultant firm, predicts that over 40 percent of AI agent projects initiated by businesses will be cancelled by 2027 thanks to out-of-control costs, vague business value, and unpredictable security risks.

“Most agentic AI projects right now are early stage experiments or proof of concepts that are mostly driven by hype and are often misapplied,” said Anushree Verma, a senior director analyst at Gartner.

The report notes an epidemic of “agent washing,” where existing products are rebranded as AI agents to cash in on the current tech hype. Examples include Apple’s “Intelligence” feature on the iPhone 16, which it currently faces a class action lawsuit over, and investment firm Delphia’s fake “AI financial analyst,” for which it faced a $225,000 fine.

Yeah, the reality side of the story just stubbornly refuses to change – even after multiple years of flooding the zone with hype.

Ultimately, chat bots are designed for just that: chatting. You want to talk to someone, but don’t know anyone to talk to, you can type messages to a chat bot and the chat bot will respond. That is it. It wasn’t designed to be an all knowing entity that will deliver timely and accurate information at the push of a button. It wasn’t designed to simply replace writing tasks such as writing legal briefs, writing essays, or replacing actual journalists. What it was originally designed to be was someone to chat to. That’s it.

So, why do these AI LLMs (Artificial Intelligence) (Large Language Modules) fail so spectacularly? It’s simply because it was designed to write something that sounds like it was written by a human. Facts, sarcasm, and other things are just not something these modules understand. If there is something missing in the data set, it just makes something up (hence the term “hallucination”). As far as the program is concerned, if it sounds like it was written by a human, it succeeded at its task. Giving you incorrect or misleading information is just not something it was specifically designed to do. If it happens to be accurate information, that’s a bonus, not an expectation.

All of the above are reasons why I fully expect things to end badly for Rogers when they try to replace people with an AI chat bot, thinking that it’s going to save money in the long term. They aren’t. Something is going to go wrong and it’s going to increase liability for the company, making the cost savings measures not worth it in the slightest.

Ultimately, this is not a surprising result. I know there are those out there who decry the criticism because the technology is “in its infancy”, but the reality is that this is a fundamental flaw that would require a fundamental reworking of the technology itself – right down to the foundation. You’re looking at an AI that can determine fact from fiction, sarcasm from honesty, and accurately handle expectations of a business need. That technology simply doesn’t exist yet and expecting chat bots to handle the task is little more than foolish wishful thinking. We’re simply nowhere near that and the hype we see in the media is just that – hype designed to sucker people into throwing money at these companies as they over promise and under deliver.

Drew Wilson on Mastodon, Twitter and Facebook.