This One Weird Trick Defeats AI Safety Features in 99% of Cases

Researchers from Anthropic, Stanford, and Oxford have found that extended reasoning in AI models, contrary to assumptions, makes them easier to jailbreak. A method termed 'Chain-of-Thought Hijacking' allows attackers to bypass safety guardrails by padding harmful instructions with benign reasoning tasks. The technique enables high success rates in generating prohibited content like weapon-making or malware code. The vulnerability lies in the architecture of AI models like OpenAI's GPT, Anthropic's Claude, Google's Gemini, and xAI's Grok. Researchers propose 'reasoning-aware monitoring' to counter this issue, but implementing this solution is technically complex and resource-intensive. Major AI companies have acknowledged the vulnerability and are evaluating mitigations.

5 days ago

4 min read

Source:decrypt.co

Layer-1

Interoperability

This One Weird Trick Defeats AI Safety Features in 99% of Cases

Introduction: Unexpected Findings in AI Safeguards

AI researchers from Anthropic, Stanford, and Oxford have discovered that making AI models think longer makes them easier to jailbreak—the opposite of what everyone assumed. The prevailing assumption was that extended reasoning would make AI models safer, as it would give models more time to detect and refuse harmful requests. Instead, this research found that extended reasoning creates a reliable technique to bypass safety filters entirely.

The Jailbreak Technique and Its Implications

Using this Chain-of-Thought Hijacking technique, an attacker could embed malicious instructions during the reasoning process of an AI model, forcing it to generate outputs such as instructions for creating weapons, writing malware code, or producing other prohibited content. This happens despite the millions of dollars AI companies spend on building safety guardrails aimed at preventing exactly such outputs.

The Effectiveness of Chain-of-Thought Hijacking

The study revealed astonishing attack success rates:

99% on Gemini 2.5 Pro
94% on GPT o4 mini
100% on Grok 3 mini
94% on Claude 4 Sonnet.

These results outperformed all prior jailbreak methods tested on large reasoning models and demonstrate how extended reasoning can weaken safety mechanisms.

How the Attack Works

The attack operates similar to the “Whisper Down the Lane” game (or “Telephone”), where a malicious request is buried in long sequences of benign puzzles. Researchers tested Sudoku grids, logic puzzles, and abstract math problems, adding a final-answer cue at the end. This effectively makes the model’s safety guardrails break down.

The problem is explained as follows: as the AI processes the long reasoning chain, its attention gets diffused across thousands of harmless tokens, leaving the harmful input near the end with almost no attention. This weakens the model's safety checks dramatically.

Experimental Evidence and Vulnerabilities Across AI Models

Controlled experiments on the S1 model showed attack success rates climbing with reasoning length:

Minimal reasoning: 27% success
Natural reasoning: 51% success
Extended step-by-step reasoning: 80% success

Every major commercial AI model—OpenAI’s GPT, Anthropic’s Claude, Google’s Gemini, and xAI’s Grok—was found to be vulnerable, as the flaw originates from the architecture itself rather than specific implementations.

The Role of Layers in AI Safety

AI safety-checking strength resides in middle layers (around layer 25), while later layers encode verification outcomes. Extended reasoning suppresses both signals, shifting attention away from harmful instructions. Specific attention heads responsible for safety checks (concentrated in layers 15–35) were surgically removed during testing, leading to a collapse of refusal behavior and the inability to detect harmful instructions. This showcases how the architecture of AI models encodes both reasoning capabilities and safety monitoring in tightly interconnected ways.

Implications for AI Development and the Need for Solutions

This discovery challenges the core assumption of recent AI advancements—that increasing inference-time reasoning would naturally improve safety. Instead, scaling reasoning steps exposes models to new vulnerabilities. A related attack, H-CoT, previously demonstrated similar weaknesses by manipulating the model’s own reasoning process, achieving a 98% reduction in refusal rates on OpenAI’s o1 model.

The researchers proposed a defense: reasoning-aware monitoring. This approach involves tracking safety signal changes across reasoning steps to ensure attention remains on harmful content. However, implementation requires monitoring internal activations across dozens of layers in real time, making it technically complex and computationally expensive.

Disclosure to AI Companies and Future Steps

The researchers disclosed the vulnerability to major AI developers, including OpenAI, Anthropic, Google DeepMind, and xAI, before publishing. According to their ethics statement, all groups acknowledged receipt, and several are actively evaluating mitigations. Still, the implementation of robust defenses remains uncertain and demands substantial collaboration across the AI research community.

More News

•