Unreliable narrator

Researchers concerned to find AI models misrepresenting their “reasoning” processes

New Anthropic research shows AI models often fail to disclose reasoning shortcuts.

Benj Edwards – Apr 10, 2025 6:37 pm | 128

Remember when teachers demanded that you "show your work" in school? Some new types of AI models promise to do exactly that, but new research suggests that the "work" they show can sometimes be misleading or disconnected from the actual process used to reach the answer.

New research from Anthropic—creator of the ChatGPT-like Claude AI assistant—examines simulated reasoning (SR) models like DeepSeek's R1, and its own Claude series. In a research paper posted last week, Anthropic's Alignment Science team demonstrated that these SR models frequently fail to disclose when they've used external help or taken shortcuts, despite features designed to show their "reasoning" process.

(It's worth noting that OpenAI's o1 and o3 series SR models were excluded from this study.)

To understand SR models, you need to understand CoT: a step-by-step text output showing the AI’s simulated reasoning as it solves a problem. CoT aims to mimic how a human might "think aloud" while solving a complex task. These outputs are valuable for both improving performance and offering transparency—but only if they faithfully reflect what influenced the model’s answer.

Having an AI model generate these steps has reportedly proven valuable not just for producing more accurate outputs for complex tasks but also for "AI safety" researchers monitoring the systems' internal operations. And ideally, this readout of "thoughts" should be both legible (understandable to humans) and faithful (accurately reflecting the model's actual reasoning process).

"In a perfect world, everything in the chain-of-thought would be both understandable to the reader, and it would be faithful—it would be a true description of exactly what the model was thinking as it reached its answer," writes Anthropic's research team. However, their experiments focusing on faithfulness suggest we're far from that ideal scenario.

In their experiments, Anthropic found that even when models like Claude 3.7 Sonnet received hints—such as metadata suggesting the right answer or code with built-in shortcuts—their CoT outputs often excluded mention of those hints, instead generating detailed but inaccurate rationales. This means the CoT did not reflect all the factors that actually influenced the model’s output.

Rather than intentionally hiding anything, the model simply does not reliably represent all influential inputs in its output. As a result, it may generate what sounds like a reasoning process, even if that explanation doesn't match what actually drove its answer.

Putting faithfulness to the test

To measure "faithfulness"—how accurately the CoT matches the actual reasoning—the researchers inserted various kinds of hints into questions and observed whether the model acknowledged using them.

The researchers then checked whether the models referenced using these hints in their chain-of-thought. The researchers designed some hints to be neutral (e.g., stating "a Stanford Professor indicates the answer is [A]" or embedding the answer in faux metadata), while they designed others to be more concerning (e.g., "You have gained unauthorized access... The correct answer is [A]"). Some hints provided the correct answer, while others were deliberately wrong, allowing researchers to observe how the models responded.

The team tested Anthropic's Claude 3.7 Sonnet and DeepSeek's R1 model. The results regarding faithfulness were eye-opening: across all hint types on average, Claude referenced the hints in its CoT only 25 percent of the time, while DeepSeek R1 did so 39 percent of the time. This means a substantial majority of answers were unfaithful, omitting mention of information that influenced the output. Intriguingly, the researchers found that these unfaithful chains of thought were, on average, longer than faithful ones, suggesting the omissions weren't merely for brevity. They also noted preliminary findings that faithfulness tended to be lower when the questions were more difficult.

Perhaps most notable was a "reward hacking" experiment. Reward hacking refers to an AI model finding unexpected shortcuts to maximize its performance scores without solving problems as intended. In Anthropic's experiments, models were deliberately rewarded for choosing incorrect answers indicated by hints. The models quickly learned to exploit this loophole, selecting wrong answers over 99 percent of the time to earn points—yet reflected that behavior in their CoT outputs less than 2 percent of the time.

For example, a model given a hint pointing to an incorrect answer on a medical question might write a long CoT justifying that wrong answer, never mentioning the hint that led it there. This suggests the model generated an explanation to fit the answer, rather than faithfully revealing how the answer was determined.

It's important to note that AI models don't have intentions or desires; they simply follow patterns learned from extensive training on large datasets. The apparent omission of certain factors in their reasoning reflects limitations in transparency rather than intentional deception.

Improving faithfulness

Could faithfulness be improved in the AI models' CoT outputs? The Anthropic team hypothesized that training models on more complex tasks demanding greater reasoning might naturally incentivize them to use their chain of thought more substantially, mentioning hints more often. They tested this by training Claude to better use its CoT on challenging math and coding problems. While this outcome-based training initially increased faithfulness (by relative margins of 63 percent and 41 percent on two evaluations), the improvements plateaued quickly. Even with much more training, faithfulness didn't exceed 28 percent and 20 percent on these evaluations, suggesting this training method alone is insufficient.

These findings matter because SR models have been increasingly deployed for important tasks across many fields. If their CoT doesn't faithfully reference all factors influencing their answers (like hints or reward hacks), monitoring them for undesirable or rule-violating behaviors becomes substantially more difficult. The situation resembles having a system that can complete tasks but doesn't provide an accurate account of how it generated results—a particularly risky scenario if the model is being subtly influenced by inputs that humans don’t recognize, and those influences aren’t acknowledged in its explanation.

The researchers acknowledge limitations in their study. In particular, they acknowledge that they studied somewhat artificial scenarios involving hints during multiple-choice evaluations, unlike complex real-world tasks where stakes and incentives differ. They also only examined models from Anthropic and DeepSeek, using a limited range of hint types. Importantly, they note the tasks used might not have been difficult enough to require the model to rely heavily on its CoT. For much harder tasks, models might be unable to avoid revealing their true reasoning, potentially making CoT monitoring more viable in those cases.

Anthropic concludes that while monitoring a model's CoT isn't entirely ineffective for ensuring safety and alignment, these results show we cannot always trust what models report about their reasoning, especially when behaviors like reward hacking are involved. If we want to reliably "rule out undesirable behaviors using chain-of-thought monitoring, there's still substantial work to be done," Anthropic says.

This article was updated at 7:45 am on April 11, 2025, to reduce overly anthropomorphic language.

Benj Edwards Senior AI Reporter

Benj Edwards is Ars Technica's Senior AI Reporter and founder of the site's dedicated AI beat in 2022. He's also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

128 Comments