OpenAI detects accidental chain-of-thought grading in models, finds no monitorability loss

1 hour ago 1



OpenAI disclosed that several of its AI models, including GPT-5.4 Thinking and various GPT-5.4 iterations, experienced accidental chain-of-thought grading during reinforcement learning training. Internal analyses found no significant degradation in the models’ ability to show their work.

The incidents affected less than 3.8% of training samples in the most impacted models. A small fraction of the training process accidentally rewarded or penalized models based on their internal reasoning steps, rather than just their final outputs.

What actually happened

The accidental grading took limited forms. Some training runs rewarded trajectory usefulness, essentially giving models a thumbs-up for how helpful their reasoning paths looked. Others penalized unnecessary prompts within the chain of thought. The most notable test case showed roughly a 2% firing rate for penalizing CoT references to cheating.

OpenAI’s internal team ran automated scans across all of its reinforcement learning runs to examine the impact on reasoning transparency. Models could still reliably trace logical reasoning, and the ability to detect potential misalignments remained functionally intact.

The safety ecosystem responds

External organizations including METR, Apollo Research, and Redwood Research contributed insights to the findings. Redwood Research acknowledged that the minor incidents did not harm monitorability but flagged that chain-of-thought reasoning, as a safety measure, has inherent vulnerabilities.

Anthropic published a report in April 2026 examining similar dynamics in its own models. OpenAI has been escalating its detection measures since December 2025 to prevent future grading errors. The company has now implemented automated detection systems and internal safeguards specifically designed to catch CoT grading contamination before it can influence training at scale.

What this means for crypto and AI tokens

No immediate market reaction was observed in AI-related crypto assets following the announcement. AI models are increasingly embedded in blockchain applications including smart contract audits, decentralized AI agents, and automated trading systems, all of which rely on AI that reasons correctly and transparently.

The fact that monitorability remained intact is the key takeaway for anyone building or investing in AI-integrated crypto projects. It means the safety infrastructure around reasoning models is catching problems before they become systemic.

Disclosure: This article was edited by Editorial Team. For more information on how we create and review content, see our Editorial Policy.

Read Entire Article