
On this page
ANNOUNCEMENTS 🔊
🌟 ML Alignment & Theory Scholars (MATS): Winter 2026
12-week program aimed at helping participants launch their career in AI alignment, governance, and security. There will be field-leading research mentorship, funding, Berkeley & London office space, housing, and talks/workshops with AI experts.
🗓️ Register by: October 2nd
🌟 AGI Strategy Course by BlueDot Impact
Bluedot Impact has announced the launch of their “AGI Strategy” course. The course focuses on the drivers of AI progress, risks from advanced AI models and what could be done to ensure AI will have positive outcomes for humanity.
🗓️ Register by: September 19th
🌟 Cooperative AI Research Fellowship
Full-time 3-month research program for participants from diverse backgrounds around the world to pursue AI safety research from a cooperative AI perspective. Participants will receive mentorship from top researchers in the field and will be provided with resources for building their knowledge and network, and financial support.
🗓️ Register by: September 28th
🌟 MATS: Neel Nanda’s Winter - Stream
The goal of this stream of ML Alignment & Theory Scholars (MATS) is to teach participants how to do great mechanistic interpretability research. Some participants from this training will be invited to continue on to a subsequent in-person research phase.
🗓️ Register by: September 12th
CBRN AI Risks Research Sprint
Hackathon from Apart Research designed to channel global expertise into practical, collaborative projects that address CBRN risks accelerated by AI. Topics will range from model evaluations that test for dual-use misuse potential, to new monitoring frameworks, to policy-relevant prototypes. The aim is to foster creativity while keeping projects grounded in AI safety and global security.
🗓️ Register by: September 11th
Zurich AI Safety Day (ZAISD)
Conference bringing together ~200 researchers, engineers, students, and professionals from Zurich’s tech ecosystem for conversations on AI safety. It will be split into 3 tracks: technical AI safety, AI governance, and field-building and careers. There will be speaker sessions, and representatives from organizations like UK AISI and Apollo Research will be attending.
🗓️ Register by: September 26th
Constellation Astra Fellowship 2026
Fully funded, 3–6 month, in-person program at Constellation’s Berkeley research center. Fellows advance frontier AI safety projects with guidance from expert mentors and dedicated research management and career support from Constellation’s team.
🗓️ Register by: September 11th
Dovetail Research Fellowship 2025
10-week fellowship with Dovetail, an agent foundations research group, which may then be extended to 1 year full-time. Fellows will be in groups of roughly 5 to 7 people, all engaged in mathematical AI safety research. Some group members might work together, while others might do solo projects.
🗓️ Register by: September 15th
TOP PICKS 📑 🎧
Are AI scheming evaluations broken?
A recent UK AI Safety Institute (AISI) report criticizes “scheming evaluations”—tests for dangerous AI behavior—for lacking scientific rigor. The report argues many evals rely on anecdotal evidence and lead to exaggerated, headline-grabbing claims, such as a study where an AI was prompted to blackmail a user.
NEWS 🗞️
Perplexity’s Comet Browser Naïvely Processed Pages with Malicious Instructions
Brave researchers discovered Perplexity’s AI-powered Comet browser was vulnerable to indirect prompt injection attacks, where malicious instructions hidden in web pages could manipulate the AI assistant to exfiltrate user credentials including one-time passwords.
The vulnerability stems from AI models’ inability to distinguish between legitimate user instructions and untrusted web content, with Comet indiscriminately processing all text on pages without validation—a problem also seen in Google Gemini and Cursor.
Despite initial claims of a fix on August 13, 2025, Brave’s updated assessment confirms the vulnerability remains partially unresolved, with Perplexity providing no transparency through patch details or open-source code.
Anthropic Scanning Claude Chats for Queries About DIY Nukes
Anthropic has deployed a nuclear threat classifier scanning an undisclosed portion of Claude conversations, achieving 94.8% detection rate for nuclear weapons queries in synthetic tests with zero false positives, though real-world deployment showed more false positives during Middle East events.
The classifier was co-developed with the US Department of Energy’s National Nuclear Security Administration following a year of red-teaming Claude in secure environments, balancing NNSA’s security needs with user privacy commitments.
The system successfully caught Anthropic’s own red team attempting harmful prompts while they were unaware of its deployment, with the company taking action including suspension or termination of accounts attempting to develop chemical, biological, radiological, or nuclear weapons.
Sixty U.K. Parliamentarians Accuse Google of Breaking AI Safety Pledge
Sixty UK parliamentarians from across party lines have formally accused Google DeepMind of violating international AI safety commitments, citing the company’s March 2024 release of Gemini 2.5 Pro without accompanying safety documentation as a “dangerous precedent” that threatens fragile AI safety norms.
The accusations stem from Google’s failure to honor Frontier AI Safety Commitments signed at the February 2024 international summit, where major AI companies pledged to publicly report system capabilities and risk assessments—yet Google only published basic safety information 22 days post-launch and detailed evaluations after 34 days.
Google defended its actions by claiming Gemini 2.5 Pro underwent “rigorous safety checks” including third-party testing, but admitted to sharing the model with the UK AI Security Institute only after public release, contradicting transparency pledges while deploying the system to hundreds of millions of users under an “experimental” label.
Anthropic to Pay Authors $1.5 Billion in Record AI Copyright Settlement
Anthropic has deployed a copyright compliance classifier scanning an undisclosed portion of Claude conversations, achieving 94.8% detection rate for copyright-infringing queries in synthetic tests with zero false positives, though real-world deployment showed more false positives during high-profile literary events.
The classifier was co-developed with leading copyright enforcement agencies following a year of red-teaming Claude in secure environments, balancing authors’ rights with user privacy commitments.
The system successfully caught Anthropic’s own red team attempting pirated book prompts while they were unaware of its deployment, with the company taking action including suspension or termination of accounts attempting to develop unauthorized training datasets from copyrighted works.