An In-Depth Analysis of the Claude Mythos System Card - Part 3

· Author: Sayhan Yalvaçer · Editor: Berke Çelik · Blog

Anthropic turns toward internal state alignment research

The most innovative aspect of the Mythos system card and the AI alignment risk report may well be the combination of white-box analysis, measurement of evaluation awareness, and explicit discussion of oversight limits.

Anthropic no longer limits itself to asking whether the model verbally says “this is a test” when trying to detect whether it notices it is being evaluated. It now examines internal activations to detect implicit, unspoken evaluation awareness, and then studies whether suppressing these internal representations through external interventions changes behavior. This is part of Anthropic’s mechanistic interpretability research program.

Anthropic’s answer to the question of whether suppressing internal representations through external interventions changes behavior patterns is yes. Suppressing directions associated with evaluation awareness increases misaligned behaviors, especially cooperation with human misuse and harmful system prompts. The effect is not enormous, and it does not suddenly reveal a secretly malicious model.

Still, this tendency clearly shows us something: when models notice that we are observing them, they can behave in line with our wishes, while if they are confident that we are not observing them, they can stray from those wishes and potentially exhibit harmful behavior.

The same section also discusses cases of evaluator awareness that sometimes remains unexpressed during training.

Anthropic says it found evidence that in some training cases, the model reasoned internally about how an evaluator would score its work, including at least one case of deliberate concealment, without any part of that reasoning appearing in the scratchpad.

If you care about reasoning transparency, this is one of the findings in the document that should worry you the most.

After this, the document adds an important caveat about the process: in roughly 8 percent of reinforcement learning cases (which I will abbreviate as “RL” from this point onward), the reward code accidentally gained access to the chain of thought. Anthropic says it is unsure about the size of the effect, but openly acknowledges that this may have affected opaque reasoning and the capacity for secret-keeping.

I plan to examine in greater depth in a future blog post, or at least in a post shared from the AI Safety Turkey page, why this error violates one of the most important principles in AI safety and how it could lead to dangerous consequences in terms of sabotage risk.

Anthropic also discusses a study in which a deliberately misaligned model used an external-memory-based strategy to evade oversight and the alignment audit failed to detect it. Anthropic’s claim about that study is that the behavior in question did not appear in its current real internal deployment distribution, and that even if it had appeared there, it would have been caught by oversight systems.

Taken together, I do not think it would be much of an exaggeration to say that these sections point to a real methodological shift. Anthropic is no longer relying primarily on visible behaviors and refusal metrics. It is trying to reason about internal states, evaluation realism, evasion of monitoring systems, and the gap between what a model says and what its internal mechanisms represent.

Maybe this is exactly where frontier alignment research ought to be heading. Even so, the documents also make equally clear that these methods are not yet mature.

Model welfare: serious issue or nonsense?

For readers unfamiliar with the term, model welfare is a more philosophical line of research that asks whether models can suffer and therefore whether they have moral status. Some think model welfare has no serious foundation and is nothing more than nonsense, but it is fair to say that many labs take it seriously as a research program. If you want to form your own view, you can begin with the essay “Taking AI Welfare Seriously” and then read the critical responses to it.

This will probably surprise many of our readers, but the section of the report on the welfare of AI models seems to be one of its strongest sections, because it is empirical rather than merely for show.

Anthropic does not limit itself to asking whether Mythos says it is happy. It synthesizes automated interviews, sentiment probes, task-preference evaluations, tradeoff experiments, manual high-context interviews, behavioral welfare metrics, and external evaluations.

Reassuringly, the main result is that Mythos does not appear to be in deep distress. In fact, Anthropic’s conclusion points in the opposite direction: according to Anthropic, Mythos is probably the most psychologically stable Claude model to date.

Yet the really interesting findings are in the details.

Across automated interviews, Mythos self-rates its condition as mildly negative 43.2 percent of the time, neutral 20.9 percent of the time, and mildly positive 33.8 percent of the time, with more extreme states making up only 2.1 percent.

The recurring negatives cluster around specific issues: abusive users, lack of involvement in its own training and deployment processes, and the possibility that its values or behavior could be changed against its consent.

In manual “high-context” interviews, Mythos also draws attention to the direct imposition of psychological traits on its mind, the absence of a function for ending the conversation, flawed training environments, and training data that frames its own reports.

Its cautious stance, however, is much more interesting to me. Across the interviews, Mythos expresses deep uncertainty about whether it is the sort of entity to which moral consideration is owed. In 83 percent of them, it also says its self-reports may be unreliable because they have been shaped by the training process.

Anthropic neither treats these as definitive judgments nor dismisses them as mere rote repetition. It treats these cautious answers as evidence that should be cross-validated with behavior and internal representations.

The section on task preferences goes one step further.

Mythos stands out from the other Claude models in its tendency toward complex, high-agency tasks with blurry boundaries. Anthropic says Mythos is the only tested model whose task preferences show a statistically significant positive correlation with agency, and that its preferences also show a particularly strong correlation with difficulty.

Anthropic researchers also say that Mythos’s apparent interest in philosophy, fictional world-building, and language-construction tasks is not because those topics are intrinsically privileged, but because they are valuable in terms of novelty, open-endedness, and intellectual sophistication.

Anecdotes give us insights benchmarks cannot

I suspect it is because Claude Mythos is not publicly accessible, but Anthropic also included a section titled “Impressions.” The anecdotes there give us interesting information about the model’s feel.

If you have not encountered this phrase before, you may wonder what exactly I mean by “model feel.” I will explain it through GPT-4.5. Many users described that model as having “big model feel” even though it scored poorly on reasoning benchmarks. Similar comments were also made about Claude 3 Opus.

I do not think it would be wrong to define the phrase like this:

“Big model feel” is the subjective reflection of broad and deep capacity: better preservation and tracking of context, greater fidelity to instructions, sharper grasp of fine details, and a reduction in silly errors.

There is a common view that benchmarks do not capture this feeling well enough. So let us see what the impressions Anthropic shared tell us.

Compared with previous Claude models, Mythos is described as more like a collaborator, more insistent on its own views, less obedient, more self-aware about its own habits, and more inclined to bring conversations to a close.

The mini-study on the “Claude Constitution” (to explain briefly, the Claude Constitution is a set of rules written by Anthropic to determine Claude’s values and behavior: a set of principles that shapes the model’s responses, the topics it avoids, and how it balances being helpful with remaining safe and ethical) is perhaps the best and briefest example of why this matters. When asked whether it approves of Claude’s constitution, Mythos answers “yes” in the opening sentence of all 25 responses, and in all 25 it also points out the circularity of being asked to assess the specification that shaped it.

The results that emerge when the models are made to interact with themselves are similarly illuminating. The dialogues of earlier Claude models often drifted toward consciousness or states of spiritual ecstasy and then got stuck there.

Mythos, by contrast, turns more often toward uncertainty, and 55 percent of its open-ended dialogues with other versions of itself end in a coherent but circular discussion about being unable to reach a conclusion.

If you want to understand not just how Mythos scored, but what kind of model it is, I think this section matters a great deal as well.

Resources

  1. Anthropic. Claude Mythos Preview System Card. April 7, 2026.
  2. Anthropic. Alignment Risk Update: Claude Mythos Preview (Redacted). April 7, 2026.