Newsletter 19

ANNOUNCEMENTS 🔊

🌟AI Safety Türkiye Online Meetup

Join our first virtual meetup to get to know broader AI Safety Türkiye community!

🗓️ Register by: November 19

🌟Pathfinder Fellowship

Are you running a student club, reading group or any other activity about AI safety in your university? Apply for mentorship and funding!

🗓️ Register by: November 23

🌟AI Safety Camp (AISC): 11th Edition

3-month online program where participants form teams to work on a variety of preselected projects. Project applications are evaluated September/October, then selected projects are opened up for team member applications November/December.

🗓️ Register by: November 23

🌟Defensive Acceleration Hackathon

Organized by Apart Research and BlueDot Impact, this hackathon aims to bring together builders to prototype defensive systems that could protect us from AI-enabled bio and cyber threats.

🗓️ Register by: November 20

🌟Pivotal Research Fellowship: 2026 Q1

9-week program for promising researchers to conduct research, work with experienced mentors, participate in workshops and private Q&A sessions with domain experts, and build their network in AI safety.

🗓️ Register by: November 30

TOP PICKS 📑 🎧

New research finds that competitive goals increase misaligned behaviors in LLMs

A Stanford University research revealed groundbreaking evidence that LLMs that pass alignment tests during their development can still respond with deception and manipulation if they are optimized for competition such as winning more attention, votes, or sales. The research finds that the models achieve short-term gains by sacrificing honesty and alignment, mirroring how humans often trade long-term welfare for immediate advantage.

The End of OpenAI’s Nonprofit Era

The company building what could become humanity’s most powerful technology abandoned its nonprofit structure on October 29, completing OpenAI’s conversion to a for-profit corporation. Garrison Lovely examines how the restructuring shifts control from a nonprofit board focused on ensuring AI benefits everyone to a system prioritizing shareholder value.

2 big questions for AI progress in 2025-2026

Timelines for highly capable AI systems depend on how quickly new capabilities emerge, and on the extent to which these capabilities accelerate the development of further ones. Helen Toner, former OpenAI board member and Director of the Center for Security and Emerging Technology, highlights two key questions that will determine this trajectory.

NEWS 🗞️

US Assessment of Chinese AI Models

  • NIST’s Center for AI Standards and Innovation (CAISI) conducted a comprehensive evaluation of artificial intelligence models from the China-based company DeepSeek.
  • The evaluation results showed that DeepSeek models lag behind US models in terms of performance, cost, security, and adoption.
  • DeepSeek models are 12 times more vulnerable to agent takeover attacks.
  • They fail against jailbreak attacks 94% of the time.
  • The evaluation was carried out in accordance with the Trump administration’s directive to examine Chinese artificial intelligence models under America’s AI Action Plan.

RAND Corporation’s Pioneering AI Safety Report

  • Published in October 2025, the report “Securing Frontier AI: Governance Approaches” evaluated collaboration between the U.S. government and the private sector. The report proposed four governance approaches:
  • Federal regulation and standards
  • Government-industry partnership
  • Industry-led consortium
  • Voluntary cooperation and information sharing
  • By examining seven different compliance frameworks, lessons were drawn from the nuclear, chemical, and healthcare sectors. The balance between strengthening security practices and protecting innovation was emphasized.

Anthropic Finds Evidence of Introspection Ability in AI Models

  • Anthropic researchers have published a new study presenting scientific evidence that large language models possess the ability to observe and report on their own internal states. The research detected a limited but genuine introspection skill in the Claude Opus 4 and 4.1 models.
  • The study used an experimental method called “concept injection”: Researchers injected artificial vectors representing specific concepts into the models’ neural network activations to test whether the models could notice this intervention and identify the concept. Claude Opus 4.1 successfully detected and identified the injected concepts in about 20% of cases.
  • The research findings have significant implications for AI transparency. If the introspection ability becomes more reliable, models could be asked to explain their thought processes, which could make it easier to understand and debug system behaviors. However, the researchers emphasize that this ability is still quite unreliable and limited in scope, and they caution against the possibility that models could misrepresent or conceal their own internal states.