
Gain proven strategies and best practices for platform owners, architects, developers, CIOs, release managers, and QA leaders.
AI-Native Platforms
Agentic AI

When the Agent Harness Fails — and the Standard Controls Miss It
On 23 April 2026, Anthropic published an engineering postmortem tracing six weeks of quality degradation in Claude Code to three separate product-layer changes. All three were resolved on 20 April in version 2.1.116. Anthropic reset usage limits for all subscribers on 23 April.
The document matters to every enterprise running AI-assisted development — not because a frontier vendor made mistakes, but because it records precisely how AI code quality regressions can survive every standard quality gate an organisation deploys. Internal code review passed. Unit tests passed. End-to-end tests passed. Automated verification passed. Internal dogfooding passed. The regressions still reached production users for six weeks. That failure pathway is the central finding, and it is the argument at the heart of AI Code Governance as a discipline.
Three Regressions, One Compound Effect
The three regressions were distinct in cause, timing, and affected surface area. Taken together they resembled broad, inconsistent degradation — each change affected a different slice of traffic on a different schedule, making the aggregate effect difficult to separate from normal variation in user feedback.
On 4 March 2026, Anthropic changed Claude Code's default reasoning effort from high to medium. The intent was to reduce latency: Opus 4.6 in high-effort mode occasionally made the Claude Code interface appear frozen. Internal evaluation indicated medium effort delivered slightly lower intelligence with significantly less latency on most tasks. Users reported immediately that Claude Code felt less capable. Anthropic described the decision in its postmortem as the wrong tradeoff and reverted it on 7 April. All models now default to high or xhigh effort.
On 26 March, a caching optimisation shipped that was intended to clear older reasoning history from sessions idle for more than an hour. A bug caused the pruning logic to fire on every subsequent turn for the rest of each session, rather than once. Claude progressively lost memory of its own reasoning: it appeared forgetful, repeated steps, and made inconsistent tool choices. Each pruned request also produced a cache miss, which Anthropic identifies as the likely driver of separate reports about usage limits draining faster than expected. The fix shipped 10 April in v2.1.101.
On 16 April, a system prompt instruction was added to reduce the verbosity of Opus 4.7, capping text between tool calls at 25 words and final responses at 100 words. Multi-week internal testing showed no regression. A broader ablation study run during the investigation revealed a 3% drop in coding quality evaluations for both Opus 4.6 and Opus 4.7. The instruction was reverted on 20 April.
Why the Regressions Escaped Detection
Anthropic was explicit on the caching bug: the changes it introduced made it past multiple human and automated code reviews, as well as unit tests, end-to-end tests, automated verification, and dogfooding. A separate internal experiment that changed how thinking was displayed suppressed the bug in most CLI sessions Anthropic's own engineers used. The team that built the product could not reproduce what production users were experiencing.
This is a structural problem, not a competence problem. When an AI coding agent behaves differently across account types, session states, and infrastructure configurations, no internal testing regime — however thorough — closes every failure mode. The vendor's environment shares too many assumptions with its own product layer. An independent governance platform, operating outside that environment and evaluating the artefacts users actually receive, closes this gap by design.
The scale makes this urgent. According to analysis by SemiAnalysis, Claude Code now authors approximately 4% of all public GitHub commits. A six-week quality regression at the agent harness level touches millions of commits in real codebases, many of which will have passed standard review and entered production before the vendor acknowledged the problem.
The Opus 4.7 Back-Test: A Finding About Context and Review Quality
As part of its investigation, Anthropic back-tested Code Review against the offending pull requests using Opus 4.7. When provided with the code repositories necessary for complete context, Opus 4.7 found the bug. Opus 4.6 did not.
The same class of tool — AI-assisted code review — produced materially different results depending on model generation and context completeness. Opus 4.6 reviewed the same pull requests and cleared them. Opus 4.7, with full repository context, caught what Opus 4.6 missed.
The governance implication is direct: AI code review must operate with full codebase context, not diff-level context alone. Quality Gates that process only the changed lines cannot surface cross-file and cross-module defects of the kind this caching bug represented. Context completeness is a governance requirement, not a convenience feature.
What This Means for Regulated Enterprises
The postmortem confirms something enterprises should state explicitly in their AI development policies: the quality controls embedded inside an AI coding agent are not a substitute for independent governance. Anthropic's engineers used Claude Code daily. Their internal usage did not reproduce the issues users experienced. The regression ran in production for six weeks across Claude Code, the Claude Agent SDK, and Claude Cowork simultaneously.
For enterprises regulated under DORA, the FCA's operational resilience rules, or SOC 2, this independence is not a preference. Change management controls require that code quality validation be demonstrably independent of the production tool that introduced the change. A governance record that points entirely to the vendor's own internal testing does not satisfy that requirement. The postmortem illustrates precisely why.
Enterprises using Claude Code to author changes in ServiceNow, Salesforce, or any other regulated platform carry the same obligation. Agent-generated changes to production platform configurations must pass quality validation that operates outside the agent's own product layer, before deployment.
The Role of Quality Clouds
Quality Clouds provides the AI Code Governance layer between AI-generated code and production deployment. LivecheckAI monitors agent behaviour continuously, surfacing regression effects before they reach enterprise systems — regardless of whether a regression originates in model weights, reasoning defaults, context management, or system prompt changes.
Full Scan analyses entire platform repositories against Quality Gates that Quality Clouds' AI Rule Builder maintains independent of any vendor's model update or product-layer configuration. When Claude Code changes its reasoning defaults, ships a caching defect, or introduces a system prompt that degrades output quality, Quality Clouds catches the effects in the artefacts those changes produce — before those artefacts reach ServiceNow, Salesforce, or any other enterprise platform.
Conclusion
Anthropic's postmortem states it plainly: this was not the experience users should expect from Claude Code. It is also a precise technical record of why AI Code Governance exists as a discipline. Six weeks. Three regressions. Every vendor-operated quality gate bypassed. The model weights never changed — the harness changed, and the standard controls failed to catch it. Enterprises that treat AI coding agents as self-governing production tools are accepting the same risk.
Production-Ready AI Code requires an independent governance layer
That is what Quality Clouds provides
Frequently Asked Questions

Albert Franquesa
Co-Founder & CSO, Quality Clouds
Related articles
Discover stories, tips, and resources to inspire your next big idea.

AI Code Governance
Agentic AI
Salesforce
AI Finds Zero-Days Autonomously. Who Is Accountable When AI Ships One into Production?

Albert Franquesa
5 min read
ServiceNow Build Agent will quadruple usage in twelve months. The CISO’s question has shifted from detection to accountability.

AI Code Governance
AI-Native Platforms
Security & Compliance
Lovable App Health Check: How IT Teams Catch What AI Misses

Albert Franquesa
6 min read
Catch RLS gaps, hardcoded secrets, and GDPR failures with Quality Clouds AI code governance

AI Code Governance
AI-Native Platforms
Security & Compliance
Lovable Security Checks Are Not Enough: Why Every Team Needs Automated, Platform-Specific Code Review

Angel Marquez
10 min read
Teams must automate platform-specific code reviews to encode senior expertise into enforceable quality gates.