Apple's new paper analyzes the accuracy collapse issue of DeepSeek-R1.

2025-06-09 14:47:45

Those who have used the DeepSeek-R1 model are likely familiar with its thought process before providing answers, which is one of the reasons why large reasoning models (LRMs) like DeepSeek-R1 are highly regarded.

However, a team of six researchers from Apple has questioned this. By having the model solve various puzzles, the research team discovered that the cutting-edge large inference models DeepSeek-R1, o3-mini, and Claude-3.7-Sonnet-Thinking experience a complete breakdown in accuracy after exceeding a certain complexity threshold.

Figure | Related Papers (Source:

Notably, Samy Bengio, Apple's senior director of machine learning research, is a co-author of the paper. Not only is he the younger brother of Turing Award winner Yoshua Bengio, but he was also one of the first members of the Google Brain team.

Image | Six authors of the related paper, the second from the right is Samy Bengio (Source: File photo)

A netizen on X concluded that Apple was a Gary Marcus (Gary Marcus), in fact, Gary Marcus himself also posted on LinkedIn to affirm Apple's paper. He wrote: "Apple's latest paper on the ability to 'reason' in large language models is quite impressive. In a long weekend article, I explain why (and explore a possible objection) to show why you shouldn't be too surprised. ”

In Gary Marcus's "Weekend Long Read," he writes: "This new paper from Apple further corroborates my critical viewpoint: even though the latest so-called 'reasoning models' have iterated beyond version o1, they still cannot achieve out-of-distribution reliable reasoning on classic problems like the Tower of Hanoi. For those researchers who hope that 'reasoning ability' or 'reasoning during computation' can get large language models back on track, escaping the repeated failures of mere scale expansion (which has never produced a technological breakthrough worthy of the name 'GPT-5'), this is undoubtedly bad news."

Image | Gary Marcus's "Weekend Long Read" published on his personal website (Source:

So, is this "bad news" or "good news"? Let's start with the details of this paper from Apple.

can perform up to 100 correct actions, but cannot provide more than 5 steps of correct operations.

In the study, the research team from Apple discovered three different reasoning patterns: in low-complexity tasks, standard large language models outperform large reasoning models; in medium-complexity tasks, large reasoning models perform better; and in high-complexity tasks, neither type of model is able to effectively complete the tasks.

As the issues approach a critical level of complexity, the effort required for reasoning paradoxically decreases, indicating that there may be an inherent limit to the scalability of large reasoning models.

The research team stated that these insights challenge mainstream assumptions regarding the capabilities of large reasoning models and suggest that current methods may have fundamental obstacles in achieving generalizable reasoning.

It is worth noting that the research team observed the limitations of large reasoning models in performing precise calculations. For example, when provided with the solving algorithm for the mathematical puzzle Tower of Hanoi, their performance on this problem did not improve.

In addition, an in-depth analysis of the model's initial error steps revealed surprising behavioral patterns. For example, the model can perform up to 100 correct actions in the Tower of Hanoi, but it cannot provide more than 5 correct operations in the logical reasoning game of the river crossing puzzle.

Overall, the research team believes that this paper highlights both the advantages of existing large reasoning models and reveals their limitations. The main research conclusions are as follows: five points:

Firstly, the research team questions the evaluation paradigm of current large reasoning models on established mathematical benchmarks and has designed a controllable experimental testing platform using algorithmic puzzle environments.

Secondly, the research team's experiments indicate that even the most advanced large reasoning models (such as o3-mini, DeepSeek-R1, Claude-3.7-Sonnet-Thinking) still fail to develop generalized problem-solving abilities. In different environments, when the complexity of the problems exceeds a certain threshold, their accuracy ultimately drops to zero.

Thirdly, the research team found that large reasoning models have an expansion limit related to the complexity of the problem in terms of reasoning capability. This can be confirmed by the counterintuitive decline in the number of thinking tokens after reaching a certain complexity point.

Fourth, the research team questioned the current evaluation paradigm based on final accuracy, and the analysis showed that as the complexity of the problem increases, the correct solution appears later in the reasoning process than the wrong solution.

Fifthly, the research team revealed the astonishing limitations of large inference models in performing precise calculations, including their inability to benefit from explicit algorithms and the inconsistency in reasoning across different types of puzzles.

The self-correction ability of large inference models is limited.

It is understood that large inference models are new variants derived from large language models specifically optimized for inference tasks.

These models belong to a new type of technological product, characterized by a unique "thinking" mechanism, such as the chain of thought (CoT) with self-reflective abilities, and have demonstrated outstanding performance in multiple reasoning benchmark tests.

The emergence of these models signifies a possible paradigm shift in the way large language models handle complex reasoning and problem-solving. Some researchers believe this represents a significant step towards more general artificial intelligence capabilities.

Despite the existence of these viewpoints and performance improvements, the fundamental advantages and limitations of large reasoning models are still not fully understood. A key question that remains unanswered is: do these large reasoning models possess generalization reasoning abilities? Or are they merely leveraging different forms of pattern matching?

How does their performance change with the increasing complexity of the problems? How do they compare to standard large language models without the "thinking" mechanism when given the same inference token computation budget?

What are the inherent limitations of the current reasoning methods? What improvements might be needed to achieve more powerful reasoning capabilities?

The research team believes that the limitations of the current evaluation paradigm lead to a lack of systematic analysis of these issues. Existing evaluations mainly focus on established mathematical benchmarks and coding standards. While these benchmarks have certain value, they often suffer from data contamination issues and cannot provide controlled experimental conditions under different scenarios and complexities.

In order to more rigorously understand the reasoning behavior of these models, the research team believes that an environment capable of conducting controlled experiments is needed.

To this end, they did not adopt a standard benchmark similar to mathematical problems, but instead used a controllable puzzle environment, which adjusts puzzle elements while retaining core logic, allowing for systematic changes in complexity and enabling the examination of the solution process and internal reasoning.

(Source: Data image)

These puzzles have the following characteristics:

(1) Able to provide fine control over complexity;

(2) avoid pollution that is common in existing benchmarks;

(3) Only rely on clearly defined rules, emphasizing algorithmic reasoning ability;

(4) Supports rigorous evaluation based on simulators, enabling precise solution verification and detailed fault analysis.

Through empirical research, they revealed several key findings regarding current large inference models:

First, although large reasoning models can learn complex self-reflection mechanisms through reinforcement learning, they fail to develop generalized problem-solving abilities for planning tasks, and their performance drops to zero beyond a certain complexity threshold.

Secondly, the research team revealed three different reasoning mechanisms through the comparison of large inference models and standard large models under equivalent reasoning calculations.

The first mechanism is: for simpler and less compositional problems, standard large models demonstrate higher efficiency and accuracy.

The second mechanism is that as the complexity of the problems increases moderately, large reasoning models gain an advantage.

The third mechanism is: when the problem becomes complex with the increase in combination depth, both types of models experience a complete performance collapse.

(Source: Data Image)

It is worth noting that as we approach this failure threshold, although the operation of large inference models is far from reaching the generation length limit, they begin to reduce inference input (measured by the number of tokens during inference) as the complexity of the questions increases.

(Source: Data Image)

This indicates that there is a fundamental limitation in the reasoning capabilities of large inference models: their reasoning time significantly increases with the complexity of the problems.

In addition, through the analysis of the intermediate reasoning trajectories, the research team discovered a regular phenomenon related to the complexity of the problems, namely that in simpler problems, the reasoning model often quickly finds incorrect solutions but still inefficiently continues to explore incorrect options. This phenomenon is what people commonly refer to as "overthinking."

In problems of medium complexity, the model needs to find the correct solution after extensive exploration of a large number of erroneous paths. Beyond a certain complexity threshold, the model is completely unable to find the correct solution.

Bai Ting, an associate professor at Beijing University of Posts and Telecommunications, told DeepTech that similar to the human way of thinking, for complex problems, although they don't know what is the right answer, many times they know what is incorrect. Specifically, this is related to the size of the solution space, because the solution space of simple problems is short and the feature matching degree is high, the correct solution is often naturally at the front end of the thinking path, while the solution space of complex problems is exponentially expanded due to the coupling of multi-dimensional variables and the nesting of logical levels, and the solution space is huge, which is objectively manifested as the relative postarity in the thinking sequence.

What happens inside the "thinking" of the inference model?

In the research, most experiments are conducted on reasoning models and their corresponding non-reasoning models, such as Claude 3.7 Sonnet (with reasoning/without reasoning) and DeepSeek-R1/V3. The research team chose these models because, unlike models such as OpenAI's o series, they allow access to thinking tokens.

For each puzzle instance, the research team generated 25 samples and reported the average performance of each model.

To gain a deeper understanding of the reasoning process of the models, the research team conducted a detailed analysis of their reasoning traces.

During this period, they achieved a deep analysis beyond the final answers of the model through the construction of a puzzle experimental environment, enabling more precise observation and analysis of the generated reasoning trajectories (i.e., the "thinking process").

Specifically, they extracted and analyzed the intermediate solutions explored during the model's thinking process using a puzzle simulator.

Subsequently, they examined the patterns and characteristics of these intermediaries, the correctness of their sequential positions in the reasoning process, and how these patterns evolve with increasing problem complexity.

In this analysis, the research team focused on the reasoning traces produced by the Claude 3.7 Sonnet reasoning model in the puzzle group experiment.

For each intermediate solution identified in the trace, the research team recorded the following: (1) its relative position in the reasoning trajectory (normalized by total thought length), (2) its correctness verified by the research team's puzzle simulator, and (3) the complexity of the corresponding problem.

This enables the research team to describe the progress and accuracy of solution formation throughout the reasoning process.

! [lgf2esRhQ8D8S5CgvuCS4e48OS2oxOtufupMh8Dx.png]###https://img.gateio.im/social/moments-ccc00958056acb2b75ba92bfa4f2d1a6 "7375006"(

The research team found that for simpler problems, reasoning models tend to find the correct solution early in the thinking process, but then continue to explore incorrect solutions.

Compared to the correct solution (green), the distribution of the incorrect solution (red) is significantly偏移 towards the end of the thought chain. As the complexity of the problem moderately increases, this trend reverses: the model first explores the incorrect solution, and only mostly arrives at the correct solution later in the thought process. This time, compared to the correct solution (green), the distribution of the incorrect solution (red) is more偏移 downwards.

Finally, for more complex problems, the model starts to crash, which means that the model can't generate any correct solution during the thought process.

The figure below presents a supplementary analysis of the accuracy of solutions within segments (intervals) of the thought sequence in the Tower of Hanoi environment.

! [n9VEKux2mllIbnTW6RTGNTE8mxgwiElcJwe7Twum.png])https://img.gateio.im/social/moments-a0586ad31fe4ba9376e30739f9c2e433 "7375007"(

It can be observed that for simpler problems (smaller N values), as the thinking progresses, the accuracy of the solutions often declines or fluctuates, providing further evidence for the phenomenon of overthinking.

However, for more complex problems, this trend can change—the accuracy of the solution increases as thinking progresses until it reaches a certain threshold. Beyond this complexity threshold, in the "collapse mode", the model's accuracy is zero.

Bai Ting told DeepTech that the model needs multiple reasoning attempts in complex problems, and under the premise of not having a correct solution, the reasoning mechanism of the model may have adopted an efficiency optimization strategy for multiple iterative reasoning, possibly as a resource protection strategy to prevent excessive iterations. Therefore, the findings in this paper need to be analyzed and verified in detail from the implementation level of the model.

Bai Ting pointed out that it is also possible that the reasoning process of large models is essentially the invocation of memory patterns. For models such as DeepSeek-R1 and o3-mini, their performance is highly dependent on the coverage of the memory mode in the training data, and when the problem complexity exceeds the coverage threshold of the memory mode (such as the controllable puzzle environment designed by the Apple research team), the model falls into a state of "zero accuracy".

Although the puzzle environment allows for fine-grained control of problem complexity in controlled experiments, they only represent a small part of reasoning tasks and may not capture the diversity of real-world or knowledge-intensive reasoning problems.

It should be noted that this research is primarily based on black-box API access to closed cutting-edge large inference models, a limitation that prevents the research team from analyzing its internal state or architectural components.

In addition, when using a deterministic puzzle simulator, the research team hypothesized that reasoning can be perfectly verified step by step. However, in less structured domains, this precise validation can be difficult to achieve, limiting the migration of the analysis method to a wider range of inference scenarios.

Overall, the research team examined cutting-edge large-scale inference models from the perspective of problem complexity through a controllable puzzle-solving environment. This result reveals the limitations of current models: despite their complex self-reflection mechanisms, these models are unable to develop generalizable inference skills beyond a certain complexity threshold. The research team believes that this result may pave the way for studying the reasoning capabilities of these models.

DEEPSEEK-16.46%

View Original

The content is for reference only, not a solicitation or offer. No investment, tax, or legal advice provided. See Disclaimer for more risks disclosure.

Reward
like
Comment
Share

Comment

0/400

No comments

Topic
#PI#
299k posts
#BTC#
266k posts
#ETH#
175k posts
4#GateioInto11#
83k posts
5#GT#
69k posts
6#ContentStar#
69k posts
7#DOGE#
63k posts
8#BOME#
62k posts
9#MAGA#
53k posts
10#SLERF#
51k posts

sitemap