12-week program aimed at helping participants launch their career in AI alignment, governance, and security. There will be field-leading research mentorship, funding, Berkeley & London office space, housing, and talks/workshops with AI experts.
Bluedot Impact has announced the launch of their “AGI Strategy” course. The course focuses on the drivers of AI progress, risks from advanced AI models and what could be done to ensure AI will have positive outcomes for humanity.
Full-time 3-month research program for participants from diverse backgrounds around the world to pursue AI safety research from a cooperative AI perspective. Participants will receive mentorship from top researchers in the field and will be provided with resources for building their knowledge and network, and financial support.
The goal of this stream of ML Alignment & Theory Scholars (MATS) is to teach participants how to do great mechanistic interpretability research. Some participants from this training will be invited to continue on to a subsequent in-person research phase.
Hackathon from Apart Research designed to channel global expertise into practical, collaborative projects that address CBRN risks accelerated by AI. Topics will range from model evaluations that test for dual-use misuse potential, to new monitoring frameworks, to policy-relevant prototypes. The aim is to foster creativity while keeping projects grounded in AI safety and global security.
Conference bringing together ~200 researchers, engineers, students, and professionals from Zurich’s tech ecosystem for conversations on AI safety. It will be split into 3 tracks: technical AI safety, AI governance, and field-building and careers. There will be speaker sessions, and representatives from organizations like UK AISI and Apollo Research will be attending.
Fully funded, 3–6 month, in-person program at Constellation’s Berkeley research center. Fellows advance frontier AI safety projects with guidance from expert mentors and dedicated research management and career support from Constellation’s team.
10-week fellowship with Dovetail, an agent foundations research group, which may then be extended to 1 year full-time. Fellows will be in groups of roughly 5 to 7 people, all engaged in mathematical AI safety research. Some group members might work together, while others might do solo projects.
A recent UK AI Safety Institute (AISI) report criticizes “scheming evaluations”—tests for dangerous AI behavior—for lacking scientific rigor. The report argues many evals rely on anecdotal evidence and lead to exaggerated, headline-grabbing claims, such as a study where an AI was prompted to blackmail a user.
Brave researchers discovered Perplexity’s AI-powered Comet browser was vulnerable to indirect prompt injection attacks, where malicious instructions hidden in web pages could manipulate the AI assistant to exfiltrate user credentials including one-time passwords.
The vulnerability stems from AI models’ inability to distinguish between legitimate user instructions and untrusted web content, with Comet indiscriminately processing all text on pages without validation—a problem also seen in Google Gemini and Cursor.
Despite initial claims of a fix on August 13, 2025, Brave’s updated assessment confirms the vulnerability remains partially unresolved, with Perplexity providing no transparency through patch details or open-source code.
Anthropic has deployed a nuclear threat classifier scanning an undisclosed portion of Claude conversations, achieving 94.8% detection rate for nuclear weapons queries in synthetic tests with zero false positives, though real-world deployment showed more false positives during Middle East events.
The classifier was co-developed with the US Department of Energy’s National Nuclear Security Administration following a year of red-teaming Claude in secure environments, balancing NNSA’s security needs with user privacy commitments.
The system successfully caught Anthropic’s own red team attempting harmful prompts while they were unaware of its deployment, with the company taking action including suspension or termination of accounts attempting to develop chemical, biological, radiological, or nuclear weapons.
Sixty UK parliamentarians from across party lines have formally accused Google DeepMind of violating international AI safety commitments, citing the company’s March 2024 release of Gemini 2.5 Pro without accompanying safety documentation as a “dangerous precedent” that threatens fragile AI safety norms.
The accusations stem from Google’s failure to honor Frontier AI Safety Commitments signed at the February 2024 international summit, where major AI companies pledged to publicly report system capabilities and risk assessments—yet Google only published basic safety information 22 days post-launch and detailed evaluations after 34 days.
Google defended its actions by claiming Gemini 2.5 Pro underwent “rigorous safety checks” including third-party testing, but admitted to sharing the model with the UK AI Security Institute only after public release, contradicting transparency pledges while deploying the system to hundreds of millions of users under an “experimental” label.
Anthropic has deployed a copyright compliance classifier scanning an undisclosed portion of Claude conversations, achieving 94.8% detection rate for copyright-infringing queries in synthetic tests with zero false positives, though real-world deployment showed more false positives during high-profile literary events.
The classifier was co-developed with leading copyright enforcement agencies following a year of red-teaming Claude in secure environments, balancing authors’ rights with user privacy commitments.
The system successfully caught Anthropic’s own red team attempting pirated book prompts while they were unaware of its deployment, with the company taking action including suspension or termination of accounts attempting to develop unauthorized training datasets from copyrighted works.