Remember when teachers demanded that you "show your work" in school? Some new types of AI models promise to do exactly that, but new research suggests that the "work" they show can sometimes be misleading or disconnected from the actual process used to reach the answer.
New research from Anthropic—creator of the ChatGPT-like Claude AI assistant—examines simulated reasoning (SR) models like DeepSeek's R1, and its own Claude series. In a research paper posted last week, Anthropic's Alignment Science team demonstrated that these SR models frequently fail to disclose when they've used external help or taken shortcuts, despite features designed to show their "reasoning" process.
(It's worth noting that OpenAI's o1 and o3 series SR models were excluded from this study.)
To understand SR models, you need to understand CoT: a step-by-step text output showing the AI’s simulated reasoning as it solves a problem. CoT aims to mimic how a human might "think aloud" while solving a complex task. These outputs are valuable for both improving performance and offering transparency—but only if they faithfully reflect what influenced the model’s answer.
Having an AI model generate these steps has reportedly proven valuable not just for producing more accurate outputs for complex tasks but also for "AI safety" researchers monitoring the systems' internal operations. And ideally, this readout of "thoughts" should be both legible (understandable to humans) and faithful (accurately reflecting the model's actual reasoning process).
"In a perfect world, everything in the chain-of-thought would be both understandable to the reader, and it would be faithful—it would be a true description of exactly what the model was thinking as it reached its answer," writes Anthropic's research team. However, their experiments focusing on faithfulness suggest we're far from that ideal scenario.
In their experiments, Anthropic found that even when models like Claude 3.7 Sonnet received hints—such as metadata suggesting the right answer or code with built-in shortcuts—their CoT outputs often excluded mention of those hints, instead generating detailed but inaccurate rationales. This means the CoT did not reflect all the factors that actually influenced the model’s output.