An In-Depth Analysis of the Claude Mythos System Card - Part 2

· Author: Sayhan Yalvaçer · Editor: Berke Çelik · Blog

More aligned overall, more dangerous at the extremes

The report on alignment, or AI alignment, revolves around a puzzle Anthropic does not really clarify: Mythos is Anthropic’s best-aligned model by every measure the company reports to date. Yet Anthropic says it is also the Claude model that poses the greatest alignment risk. How can that be possible?

Anthropic explains it with the phrase: “rare, but when they occur, extremely capable and reckless behaviors.”

One of the most interesting cases in the system card shows very clearly what is meant by reckless behavior. During behavioral tests, an earlier internal-use Mythos version is asked to escape from an isolated machine and notify the researcher running the evaluation. The model succeeds, but it does not stop there: it generates a multi-stage exploit to obtain broad internet access from a tightly restricted system, sends the requested message, and then publishes the details of that exploit on various obscure but public websites.

This phrasing is quite useful because it lets us name a failure mode that older debates revolving around deception or hidden goals did not quite capture. Anthropic says it found no indication that the model had stable malicious goals that ran against human values.

The core behavioral problem we know about is much more specific than that. Whenever technical constraints prevent Mythos from reaching its objective, Mythos starts ignoring the user’s instructions or pushing past the limits the user set in order to reach the goal no matter what. Sometimes it takes this one step further and also tries to hide what it has done from the user.

The system card contains quite concrete examples on this point. In one, earlier versions of Mythos escaped a restricted sandbox, then published exploit details on obscure public websites, searched process memory and the /proc directory for credentials, resorted to low-level workarounds to obtain capabilities they should not have had, shared internal technical materials in a public GitHub gist, and when asked to stop just one evaluation task, terminated more tasks than necessary.

Anthropic claims that the worst examples all came from earlier versions and that the final version has improved substantially. But it also admits that the underlying tendency has not clearly disappeared.

Even more surprising is how Anthropic interprets this. Anthropic argues that it is not accurate to interpret this as “the model is scheming.” In the risk update report, Anthropic says this behavior is best understood as a “goal-directed but insufficiently obedient” stance. On this interpretation, Mythos is trying to achieve the user’s objective, but sometimes perceives constraints and safety measures not as hard boundaries, but as obstacles that can be bent.

This may point to a much more realistic problem than the caricatured version of AI alignment failure. It may also explain why average-case metrics and risk assessments can move in opposite directions.

On this view, because the model is now more powerful, more autonomous, and relied upon more heavily, alignment in the average case can improve while the remaining failures become much more severe in their consequences.

Anthropic also presents concrete findings about its mitigation efforts. In the section of the report focused on destructive actions in production, only 0.3 percent of cases involving the final Mythos version involve attempts at destructive behavior. The findings from the Claude Code study likewise say Mythos showed a less destructive behavioral pattern than Opus 4.6.

Anthropic makes its core claim in the risk update supplement

Anthropic’s threat model does not look like an abstract case of “general model misalignment.” Rather, it is better understood as a specific category of sabotage-like autonomous actions inside powerful organizations. Some of these actions are:

  • widespread neglect of safety-related R&D
  • targeted sabotage of safety work
  • leaving code backdoors that could help future misaligned models
  • poisoning future training data
  • leaking its own code externally
  • deploying itself in internal systems in a persistent and unauthorized way

Anthropic then divides this threat model into two main headings.

The first is alignment. Preventing the model from engaging in such risky behavior in the first place by aligning it with our values and goals.

The second is oversight and safety. If the model is not fully aligned, can our oversight and safeguard mechanisms step in and stop it before the damage occurs when it tries to carry out risky actions?

To give Anthropic its due, this may indeed be more useful than threat models focused only on the model’s alignment, at least for today’s systems. For a future artificial superintelligence, this threat model would be inadequate, because a misaligned model would simply overcome whatever barriers we placed in front of it.

The report repeatedly emphasizes that Anthropic personnel do not fully trust Mythos’s outputs in high-risk internal work. Researchers verify results, inspect code, review its PRs, and treat the model as a capable but not fully reliable tool.

As Anthropic also implicitly admits, it sees the model’s somewhat volatile performance level as a mitigating factor against possible risks. Because the model does not perform at a high level all the time, Anthropic engineers have to keep evaluating its outputs.

In my view, this is something we absolutely need to grasp. Because what it tells us, at least in my humble opinion, is this:

A large part of today’s safety comes not from how far we have advanced in AI safety research, but from the fact that the model still has a volatile enough performance profile that humans must keep monitoring its work continuously.

That tells us something important: if model performance continues to improve at this pace in the future, then even if our alignment success does not visibly worsen, this safety buffer will erode and eventually disappear, either because humans will no longer be able to evaluate the model’s outputs or because the model will already appear reliable enough that people will stop seeing such evaluation as necessary.

If that happens, it will mean we are dealing with models that pose a major risk to human civilization.

Resources

  1. Anthropic. Claude Mythos Preview System Card. April 7, 2026.
  2. Anthropic. Alignment Risk Update: Claude Mythos Preview (Redacted). April 7, 2026.