Anthropic has released a major revision to their Responsible Scaling Policy (RSP), the risk governance system they employ to address potential catastrophic risks from cutting-edge AI systems. The revised policy maintains Anthropic's pledge to avoid training or deploying models without appropriate safety measures while introducing a more adaptive and sophisticated method for evaluating and handling AI risks.
Core Framework Components
The updated framework centers on two primary elements:
- Capability Thresholds: Specific AI abilities that would necessitate enhanced safeguards beyond current standards
- Required Safeguards: Designated AI Safety Level Standards (ASL Standards) required to address risks when thresholds are crossed
Critical Capability Thresholds
The policy establishes two significant thresholds requiring elevated safeguards:
Autonomous AI Research and Development
When models can independently perform complex AI research tasks that typically require human expertise, potentially accelerating AI development unpredictably, Anthropic will implement ASL-4 or higher security standards along with additional safety measures.
CBRN Weapons Capabilities
If models can substantially assist individuals with basic technical knowledge in developing or deploying chemical, biological, radiological, or nuclear weapons, ASL-3 standards with enhanced security and deployment controls become mandatory.
ASL-3 Safeguard Specifications
ASL-3 protections encompass strengthened security protocols and deployment controls, including:
- Internal access restrictions and improved model weight protection
- Multi-tiered misuse prevention systems
- Real-time and asynchronous monitoring capabilities
- Rapid response procedures
- Comprehensive pre-deployment red team testing
Implementation Structure
The policy establishes several implementation components:
- Capability assessments: Regular model evaluations against Capability Thresholds
- Safeguard assessments: Systematic evaluation of security and deployment safety effectiveness
- Documentation processes: Procedures modeled after safety case methodologies from high-reliability sectors
- Governance measures: Internal stress-testing protocols and external expert consultation
Lessons from Initial Implementation
After one year of operating under the original RSP, Anthropic identified several areas for improvement through their compliance review. They discovered minor procedural issues, including:
- A three-day delay in completing scheduled evaluations
- Insufficient clarity regarding documentation of evaluation methodology changes
- Opportunities to enhance evaluation performance through standard techniques
These findings led to incorporating greater flexibility into the policy and improving compliance tracking processes.
Leadership and Organizational Changes
Jared Kaplan, Co-Founder and Chief Science Officer, has assumed the role of Responsible Scaling Officer, taking over from Sam McCandlish. Additionally, Anthropic is recruiting a Head of Responsible Scaling to coordinate implementation efforts across teams.
Contributing Teams
Multiple teams contribute to risk management through the RSP:
- Frontier Red Team (threat modeling and capability assessments)
- Trust & Safety (deployment safeguards)
- Security and Compliance (security safeguards and risk management)
- Alignment Science (ASL-3+ safety measures and misalignment evaluations)
- RSP Team (policy drafting and cross-company execution)
External Collaboration
Anthropic has shared their assessment methodology with AI Safety Institutes and selected independent experts for feedback, though this doesn't constitute endorsement from these organizations.
Complementary Risk Management
While the RSP focuses on catastrophic risks, Anthropic maintains additional risk management approaches:
- Usage Policy establishing product use standards
- Technical measures for enforcing trust and safety at scale
- Research on broader societal impacts of AI models
Future Direction
Anthropic acknowledges that rapid AI advancement makes predicting appropriate future safety measures challenging. They commit to continuous evolution of their policies, evaluation methodologies, safeguards, and research into potential risks and mitigations.
The complete updated policy is available at anthropic.com/rsp, with supplementary information at anthropic.com/rsp-updates.