Anthropic Releases Version 3.0 of Its Responsible Scaling Policy

Anthropic has published the third iteration of its Responsible Scaling Policy (RSP), the voluntary framework the company uses to mitigate catastrophic risks posed by AI systems.

After more than two years of operating under an RSP, Anthropic has gained significant insight into what works and what doesn't. The updated policy reinforces successful elements, addresses shortcomings, and introduces new measures to boost transparency and accountability.

The Original RSP and Anthropic's Theory of Change

The RSP was designed to tackle AI risks that don't yet exist when a policy is drafted but could emerge rapidly due to exponential technological progress. When Anthropic wrote the first RSP in September 2023, large language models were essentially chat interfaces. Today they can browse the web, write and execute code, operate computers, and take autonomous multi-step actions. New capabilities have consistently brought new risks, and this pattern is expected to continue.

The RSP centered on a conditional or if-then principle: if a model surpassed certain capability thresholds (e.g., biological science capabilities that could help create dangerous weapons), then stricter safeguards would need to be introduced (e.g., protections against misuse and model weight theft).

Each safeguard set corresponded to an "AI Safety Level" (ASL). ASL-2 described one tier of required protections, while ASL-3 described more stringent safeguards for more capable models. Early ASLs were defined in detail, but later levels (ASL-4 and beyond) were intentionally left vague, with the plan to flesh them out once higher capability levels became clearer.

Anthropic's theory of change had several components:

An internal forcing function. The RSP was intended to make important safeguards a prerequisite for training and launching new models, clarifying their importance across the growing organization and accelerating progress.
A race to the top. Announcing the RSP was meant to encourage other AI companies to adopt similar policies, creating a dynamic where industry players compete to strengthen rather than weaken safety measures. Over time, Anthropic hoped these voluntary standards would become industry norms or inform AI legislation.
Building consensus about risks. Capability thresholds were envisioned as pivotal moments for the industry. Upon reaching a threshold, Anthropic would implement safeguards and use the evidence gathered to advocate for broader action by other companies and governments-moving from unilateral to multilateral measures.
Looking to the future. Anthropic recognized that at higher capability levels, the necessary countermeasures (such as robust defenses against state-level actors) would likely be impossible for a single company to achieve alone. The hope was that by the time those capabilities arrived, the dangers would be widely recognized and coordinated government action would be feasible.

Assessing the Theory of Change

Two and a half years later, Anthropic's honest assessment is that some elements succeeded while others fell short.

What worked:

The RSP did drive the development of stronger safeguards. To meet the ASL-3 deployment standard (focused primarily on chemical and biological weapon risks from modestly resourced threat actors), Anthropic developed increasingly sophisticated methods-specifically input and output classifiers-to block concerning content.
Implementing the ASL-3 standard did prove feasible. Anthropic activated ASL-3 protections for relevant models in May 2025 and has continued refining them.
The RSP did spur other AI companies to adopt similar frameworks. Within months, both OpenAI and Google DeepMind introduced broadly comparable policies. Some companies also deployed bioweapon-related classifiers similar to Anthropic's ASL-3 defenses. These voluntary standards have helped shape early AI policy, with governments worldwide (e.g., California's SB 53, New York's RAISE Act, and the EU AI Act's Codes of Practice) beginning to require frontier AI developers to publish frameworks for assessing and managing catastrophic risks.

What didn't work as hoped:

Using RSP thresholds to build consensus on AI risks did not play out as expected. Pre-set capability levels turned out to be far more ambiguous than anticipated. In some cases, model capabilities clearly approached thresholds, but there was substantial uncertainty about whether they definitively crossed them. The science of model evaluation isn't mature enough to provide clear-cut answers. Anthropic took a precautionary approach and deployed relevant safeguards, but internal uncertainty translated into a weak external case for industry-wide multilateral action.
- Biological risks illustrate this "zone of ambiguity." Anthropic's models now demonstrate enough biological knowledge to pass most quick tests, making it hard to argue risks are low. But those tests alone don't prove risks are high either. Additional evidence, including an extensive wet-lab trial, has yielded ambiguous results-partly because more powerful models emerge by the time such studies conclude.
Despite rapid AI capability advances, government action on AI safety has been slow. The policy environment has shifted toward prioritizing competitiveness and economic growth, while safety-focused discussions have gained little traction at the federal level. Anthropic remains convinced that effective government engagement on AI safety is both necessary and achievable and aims to continue advancing evidence-based dialogue grounded in national security, economic competitiveness, and public trust-but this is proving to be a long-term endeavor rather than something that happens organically as AI crosses capability thresholds.

While ASL-3 safeguards were implementable unilaterally at reasonable cost, this may not hold for higher capability levels. The robust mitigations outlined in the prior RSP for higher ASLs might prove impossible without collective action. A RAND report on model weight security notes that its "SL5" security standard-aimed at stopping top-priority operations by the most cyber-capable institutions-is "currently not possible" and "will likely require assistance from the national security community."

The combination of ambiguous risk evidence, an anti-regulatory political climate, and higher-level RSP requirements that are extremely difficult to meet unilaterally creates a structural challenge. Rather than weakening ASL-4 and ASL-5 definitions to make compliance easy (which would undermine the RSP's purpose), Anthropic chose to acknowledge these challenges transparently and restructure the policy before reaching those higher levels. The revised RSP adopts more realistic unilateral commitments that are difficult but achievable, while continuing to comprehensively map risks that the broader industry needs to address collectively.

Key Changes in the Updated RSP

The new version has three key elements.

1. Separating Company Plans from Industry Recommendations

Anthropic's RSP now distinguishes between two sets of mitigations: those the company plans to pursue regardless of what others do, and an ambitious capabilities-to-mitigations map that Anthropic believes would adequately manage advanced AI risks if implemented across the entire industry.

2. Frontier Safety Roadmap

The updated RSP introduces a requirement to develop and publish a Frontier Safety Roadmap describing concrete risk mitigation plans across Security, Alignment, Safeguards, and Policy. Goals in the Roadmap are intended to be ambitious yet achievable, providing the kind of forcing function that proved successful in earlier RSP versions.

Rather than being hard commitments, these are public goals against which Anthropic will openly grade its progress. This "nonbinding but publicly-declared" approach draws on the transparency model Anthropic has championed for frontier AI legislation (while providing far more detail than existing laws require) and on successes from previous RSP versions.

Example goals from the current Frontier Safety Roadmap include:

Launching "moonshot R&D" projects exploring ambitious, possibly unconventional approaches to achieving unprecedented levels of information security
Developing a red-teaming method (likely involving significant automation) that surpasses the collective output of the hundreds of participants in Anthropic's bug bounty program
Implementing systematic measures to ensure Claude behaves according to its constitution
Establishing comprehensive, centralized records of all critical AI development activities and using AI to analyze them for issues including concerning insider behavior (both human and AI) and security threats
Publishing a policy roadmap with concrete proposals for a "regulatory ladder"-policies that scale with increasing risk to help guide government AI policy

3. Risk Reports and External Review

Risk Reports build on what worked well in the previous RSP. Anthropic found that producing a proto-Risk Report-the Safeguards Report from May 2025-was valuable for internal understanding and public communication of risks. Risk Reports extend this into a more systematic, comprehensive practice.

These reports will provide detailed information about the safety profile of Anthropic's models at the time of publication. They go beyond describing model capabilities to explain how capabilities, threat models (the specific ways models might pose threats), and active risk mitigations fit together, offering an overall risk assessment. Risk Reports will be published online (with some redactions[^1]) every 3–6 months.

The new RSP also requires external review of Risk Reports under certain circumstances. Anthropic will appoint expert third-party reviewers who are deeply familiar with AI safety research, incentivized to be candid about the company's safety position, and free of major conflicts of interest. These reviewers will have unredacted or minimally-redacted access to Risk Reports and will subject Anthropic's reasoning, analysis, and decision-making to comprehensive public review. Although current models do not yet require external review, Anthropic is already running pilots and working toward this goal.

Risk Reports will also address gaps between Anthropic's current safety and security measures and the company's more ambitious industry-wide recommendations. Anthropic hopes that describing and publicizing such gaps will contribute to public awareness and beneficial policy change.

Conclusion

The Responsible Scaling Policy was always intended to be a living document with the flexibility to evolve as AI models become more capable. This third revision amplifies what worked in previous versions, commits Anthropic to greater transparency about its plans and risk considerations, and separates industry-wide recommendations from what the company can achieve on its own. In that same spirit of pragmatism, Anthropic will continue revising and refining the RSP and its methods for evaluating and mitigating risks as the technology evolves.

[^1]: As discussed in the RSP, Anthropic aims to minimize redactions in the public version of Risk Reports. Reasons for potential redactions include legal compliance, intellectual property protection, public safety, and privacy.

View source Back to news