Alarm bells sound as report shows just how inaccurate AI currently is

One of the world’s biggest tech companies could see AI slowing its roll after the new report exposed just how careless and alarmingly inaccurate the technology in its current form can really be

Shannon Molloy

6 min read

June 11, 2025 - 7:00PM

Kids News

Apple has released a bombshell report that reveals just how inaccurate AI really is. Picture: Kirill KUDRYAVTSEV / AFP

READING LEVEL: RED

If you’re thinking AI is a fast track to handing in your homework, you might want to think again.

A bombshell report from Apple has called into question the reliability and true potential of AI, with the latest form of the cutting-edge technology unable to solve complex problems with a consistent level of accuracy.

Researchers from the tech giant put large reasoning* models – an advanced version of AI used in platforms like DeepSeek and Claude – through a series of puzzle challenges ranging from simple to complex. They also tested large language* models, which platforms like ChatGPT are built on.

Large language model AI systems fared better than large reasoning models with fairly standard tasks, but both performed poorly when facing more complex challenges, the paper revealed.

Companies are investing billions in the technology despite it being unable to perform complex tasks.

Researchers also found that large reasoning models began “reducing their reasoning effort*” as they struggled to perform. To put it in human terms, they put in less effort when tasks became too hard.

The paper described AI’s tendency to reduce effort when challenged as “particularly concerning.”

The advancement of AI, based on current approaches, might’ve reached its limit for now, the findings suggested.

But despite the company’s concerns, Apple’s latest iOS 26 update includes an AI system at its core, powering everything from live translation to personalised suggestions, smarter shortcuts, and even tools to clean up photos.

Testing has revealed a relatively low accuracy rate across OpenAI’s o3 and o4-mini models. Picture: AP Photo/Michael Dwyer

EPIC AI FAILS
Questions about the quality of large language and large reasoning models aren’t new.

For example, when released in April, OpenAI described its new o3 and o4-mini models as its “smartest and most capable” yet, trained to “think for longer before responding” and “setting a new standard in both intelligence and usefulness.”

But testing by American university MIT revealed the o3 model was incorrect the majority (51 per cent) of the time, while o4-mini performed even worse with an error rate of 79 per cent.

Apple recently suspended its news alert feature on iPhones, powered by AI, after users reported significant accuracy errors.

Among the jaw-dropping mistakes was an alert with a false personal admission by tennis icon Rafael Nadal, and an announcement that a winner had been crowned at the World Darts Championship hours before competition began.

BBC analysis of accuracy issues with Apple's now-suspended news summary service.

Research conducted by the BBC found a huge number of errors across other AI assistants providing information about news events, including Google’s Gemini, OpenAI’s ChatGPT and Microsoft’s CoPilot.

It found 51 per cent of all AI-generated answers to queries about the news had “significant issues” of some form. When looking at how its own news coverage was being manipulated, the BBC found 19 per cent of answers referring to its content were factually incorrect*.

And in 13 per cent of cases, quotes said to be contained within BBC stories had either been changed or entirely made up.

Meanwhile, a newspaper in Chicago was left red-faced recently after it published a summer reading guide featuring multiple books that didn’t exist, thanks to the list being produced by AI.

There is a need to refine AI to ensure better accuracy. Picture: Gabby Jones/Bloomberg

And last year, hundreds of people who lined the streets of Dublin were disappointed when it turned out the Halloween parade advertised on an events website had been invented.

Google was among the first of the tech giants to roll out AI, summarising search results relying on a large language model – with some weird and possibly dangerous results.

Among them were suggestions to add glue to pizza and eat a rock a day to maintain health.

Australian Catholic University associate professor of computational intelligence* and Women in AI for Social Good lab director Dr Niusha Shafiabady, said “expecting AI to be a magic wand” was a mistake.

“When AI models face countless interactions with the world, it is not possible to investigate and control every single problem that could happen. That is why things could get out of hand or out of control,” she said.

It is concerning when people trust AI’s results only to find out they aren’t really trustworthy. Picture: Lionel BONAVENTURE / AFP

Dr Shafiabady said there were a few reasons for problems facing AI.

“When dealing with highly complex problems, these types of complex AI models can’t give an accurate solution. One of the reasons why is the innate nature* of algorithms*,” she said.

“Models are built on mathematical computational iterative algorithms* that are coded into computers to be processed.

“When tasks get very complicated, these algorithms won’t necessarily follow the logical reasoning and will lose track of them.

“Sometimes when the problem gets harder, all the computing power and time in the world won’t enhance AI model’s performance. Sometimes when it hits very difficult tasks, it fails because it has learnt the example rather than the hidden patterns in the data.

Despite the company’s concern, Apple has released more AI features in its products.

“And sometimes the problem gets complicated, and a lot of computation resource* and time is wasted over-exploring the wrong solutions and there is not enough ‘energy’ left to reach the right solution.”

Many companies pushing AI insist the technology is rapidly improving, but a host of experts aren’t convinced.

Earlier this year, the Association for the Advancement of Artificial Intelligence surveyed two dozen AI specialists and 400 of the group’s members.

It seems like AI is everywhere you look. Here Chris Hemsworth and Chris Pratt wear Ray-Ban Meta AI-powered glasses in a newly unveiled Super Bowl campaign. Picture: Instagram

Sixty per cent of those surveyed didn’t believe problems with factuality* and trustworthiness “would soon be solved.”

Issues of accuracy and reliability are important, not just for growing public trust in AI, but for preventing unintended consequences* in the future, AAAI president Francesca Rossi wrote in a report about the survey.

“We all need to work together to advance AI in a responsible way, to make sure that technological progress supports the progress of humanity and is aligned to human values,” Ms Rossi said.

WATCH THE VIDEO

POLL

GLOSSARY

large reasoning: large language models of AI that have been trained even further to solve tasks that require multiple steps of reasoning. They are said to perform better on mathematical or logical tasks than large language models and can backtrack in the process of coming up with an answer
large language: AI systems that are capable of understanding and producing human language by processing huge amounts of text data
reasoning effort: the amount of internal “thinking” the AI model does before producing an answer
factually incorrect: not completely true
computational intelligence: concepts, algorithms and the way that systems are put together that are designed to show intelligent behaviour in complex and changing environments
innate nature: built in qualities
algorithms: a process or set of rules that a computer is programmed to follow
computational iterative algorithms: problems solving procedures that refine a process over and over again until a successful result is found
computation resource: what the computer is able to achieve
factuality: how factual something is
unintended consequences: things that could happen as a result of AI that we don’t want to happen that could be bad for humanity

EXTRA READING
Teen rite of passage in AI’s path
Are AI sunnies clever or creepy?
AI imagines snow falling in Sydney

QUICK QUIZ
1. What are the two models of AI that were tested for accuracy?
2. How were they tested?
3. What’s an example of a large reasoning model of AI?
4. What are two AI fails that were mentioned in the article?
5. Why is addressing accuracy and reliability important, according to AAAI president Francesca Rossi?

LISTEN TO THIS STORY

CLASSROOM ACTIVITIES
1. Should we use AI?
“We should stop using AI!” Do you agree or disagree with this statement? Write paragraphs explaining your opinion.

Time: allow at least 30 minutes to complete this activity
Curriculum Links: English, Personal and Social Capability, Information Technology

2. Extension
Write a list of research guidelines or tips for kids to help them to make sure that any information that they find online is as accurate as possible.

Time: allow at least 30 minutes to complete this activity
Curriculum Links: English, Information Technology

VCOP ACTIVITY
Read this!
A headline on an article – or a title on your text – should capture the attention of the audience, telling them to read this now. So choosing the perfect words for a headline or title is very important.

Create three new headlines for the events that took place in this article. Remember, what you write and how you write it will set the pace for the whole text, so make sure it matches.

Read out your headlines to a partner and discuss what the article will be about based on the headline you created. Discuss the tone and mood you set in just your few, short words. Does it do the article justice? Will it capture the audience’s attention the way you hoped? Would you want to read more?

Consider how a headline or title is similar to using short, sharp sentences throughout your text. They can be just as important as complex ones. Go through the last text you wrote and highlight any short, sharp sentences that capture the audience.

Alarm bells sound as report shows just how inaccurate AI currently is

How gross is your water bottle?

Ancient sea monster found in Qld

How gross is your water bottle?

Ancient sea monster found in Qld