AI Safety Careers #2

· Author: Berke Çelik and Sayhan Yalvaçer · Blog

AI Safety Careers #2

Large language models: The power of scale

Scale, emergence, and RLHF

We ended the previous article with this question: What happens if we train the Transformer architecture on text from the entire internet at a massive scale?

The short answer: a lot more than expected. As models grew, it wasn’t just their error rates that dropped. Beyond certain thresholds, new skills that no one explicitly programmed began to emerge. In this post, we explore that process, the scaling dynamics behind it, and the turning points that brought us to the models we have today.

If concepts like tokens, neural networks, and the transformer architecture sound unfamiliar, we recommend reading the first article in the series first.

Large language models (LLMs) are, at their core, this same idea applied at massive scale. Such a model is trained on gigantic datasets of text, and its primary task is to predict which word, or word fragment – technically, a token – is most likely to follow a given sequence of text. Even though it sounds incredibly simple, this task forces the model to learn skills like grammar, context, world knowledge, and instruction following.

In 2020, OpenAI released GPT-3, one of the first major breakthroughs of this approach.

The model had 175 billion parameters. Parameters are the learned numerical weights and biases inside the model: they are continuously updated during training and determine which word, relationship, or pattern the model considers more likely.

This number doesn’t mean 175 billion separate rules. The model’s knowledge isn’t stored in a single place; it’s distributed across a vast pattern of connections that have been gradually adjusted during training.

For instance, if the model sees the phrase “The capital of Turkey is…” and considers “Ankara” more likely than “Istanbul,” it’s these learned weights that make the difference. Similarly, these parameters determine the tone of a sentence, whether a question is a math problem, or what a word means based on its context.

What set GPT-3 apart was that the number of these learned settings had increased dramatically compared to previous models. This quantitative increase began to look like a qualitative leap in certain tasks.

More parameters meant the model could represent finer relationships and long-distance dependencies in language. What was striking about GPT-3 was that this manifested as a new kind of behavior in some tasks.

You could ask GPT-3, “Is this customer review positive or negative?”, “What is the English translation of this French text?”, or “Summarize this paragraph in a single sentence.” Even though the model wasn’t explicitly trained for these tasks, it could provide reasonable answers. Researchers call this zero-shot learning: you can ask the model directly without needing to provide examples to teach it a new task, and the model can often accurately follow an instruction it has never seen before.

That was exactly the strange part. An ability that didn’t exist at all in smaller models suddenly emerged beyond a certain scale.

Unpredictable capabilities

Researchers called this phenomenon emergence. This term is important because it explains two things at once: why the models work so surprisingly well, and why they are unsettling from a safety perspective.

A simple analogy from the real world:

You heat water. At 60 degrees, it’s hot water; at 80 degrees, it’s hotter water. But at 100 degrees, something qualitatively different happens: the water starts to boil. The temperature increase is gradual, but the boiling is sudden. What emerges – the state change / phase transition – isn’t unpredictable from the change in input, namely temperature, but it doesn’t happen linearly either. In philosophy, similar situations are sometimes referred to as weak emergence.

Something similar was observed in large language models. As you scale a model – more parameters, more data, more compute – some skills improve gradually. Others do not: zero performance below a certain threshold, and suddenly good performance above it. Multi-step arithmetic problems, certain logic tasks, reasoning by analogy… These didn’t work at all in small models, but started working beyond a specific scale.

But not every sudden leap has to be a true phase transition. In some cases, the jump might stem more from how we measure things than from the model itself.

We need to sit back and think about this: If useful capabilities emerge unpredictably like this, dangerous capabilities could emerge the exact same way. When training a model, you can’t know exactly what it can do until the training is finished and you test it. Strategic planning or deceptive behaviors aimed at misleading people could also be capabilities that get “unlocked” beyond a certain scale. And this shows why developers’ “build first, test for safety later” approach is structurally dangerous: some behaviors that don’t show up in testing might emerge later when the model is used in new contexts.

So if these leaps are so unpredictable, why did researchers keep training larger models anyway? Because, in parallel, another finding emerged showing that as scale increased, performance also improved in a surprisingly regular way.

Scaling laws

In 2020, OpenAI researchers (Kaplan et al.) published a paper examining scaling laws for language models. What they found was roughly this: training loss, in particular, behaved like a smooth and predictable function of three variables: the number of parameters N, the amount of training data D, and the compute used C.

On paper, this relationship looked almost as clean as a law of physics. When you gave the model more resources, the outcome didn’t fluctuate randomly; it followed a roughly plottable curve. You could also roughly predict how much the loss would decrease when you increased the parameter count, data volume, or compute.

The model’s “loss,” meaning the magnitude of the error it made while predicting the next word, seemed to decrease along a smooth and predictable curve with these variables, rather than arbitrarily. Scaling thus stopped being blind trial-and-error and turned into an engineering problem with roughly calculable outcomes.

In 2022, DeepMind’s Chinchilla research took this picture a step further. What they found was this: models like GPT-3, when viewed from the perspective of a fixed compute budget, were trained on insufficient data – that is, too few tokens – relative to their size. They showed that for a fixed compute budget, it was optimal to scale parameters and data together at a specific ratio. In other words, a massive model trained on too little data for its size could perform worse than a smaller model trained on the right amount of data.

The strange and unsettling thing about scaling laws is this: These laws are extraordinarily regular, but no one can fully explain why they are the way they are. We know that more compute correlates with better performance.

But what exactly is this extra capacity doing inside the neural network? What “knowledge” is being added? Why does the relationship take this exact mathematical form? These are open questions. There is an observable regularity that allows us to predict the models’ capabilities, but the theoretical understanding underlying this regularity is missing.

If this doesn’t bother you, it should. As humanity, we are continuing to scale a system that we don’t fully understand why it works, on the assumption that it will keep working. It’s hard to call this a “controlled process.”

ChatGPT and RLHF

At the end of 2022, OpenAI released ChatGPT, which could be interacted with via a chat interface. Technically, ChatGPT was a model from the GPT-3.5 series adapted for conversation, first through supervised fine-tuning, and then through RLHF (Reinforcement Learning from Human Feedback).

The conceptual importance of RLHF was this: the base language model had learned to mimic language by being trained on internet text, but it wasn’t an “assistant” yet. Its job was simply to continue the text given to it in the most likely way. Because of this, the raw model would sometimes give a helpful answer, sometimes hallucinate, and sometimes mimic another text format it might have seen online instead of actually answering the question.

At this point, RLHF added a second training phase. The model was asked questions and prompted to generate multiple candidate answers, and human evaluators ranked these answers in order of preference. From these rankings, a separate reward model was trained to predict which answer humans would prefer more; over time, the language model was updated to produce not just “likely” continuations, but the kinds of answers humans preferred.

RLHF didn’t rewrite the model from scratch. Rather, it nudged the raw text completion behavior toward instruction-following assistant behavior. Most of the core skills came from pre-training; RLHF presented these skills in a way that was easier for the user to invoke.

The limitation is this: RLHF doesn’t guarantee the model’s internals. Most of the time, what it does is align the externally visible behavior more closely with human preferences.

The result that emerged with ChatGPT was quite interesting. Even though the underlying language model wasn’t fundamentally different, RLHF had made it usable.

Raw GPT-3 acted like a sort of oracle: you gave it text, and it continued that text, but what that continuation would be was unpredictable. ChatGPT, on the other hand, took instructions, answered questions, and could hold a dialogue. It reached 1 million users in 5 days; about two months after its release, its monthly active user count was estimated to have hit 100 million.

This speed took ChatGPT out of the realm of research demos and placed it squarely on the agenda of the public and big tech companies. The AI race gained a whole new momentum after this.

GPT-4: the second great leap

In March 2023, OpenAI released GPT-4, and the difference between the models became glaringly obvious through a benchmark.

The Uniform Bar Exam (UBE), used for bar admission in many U.S. jurisdictions, is a rigorous exam used in 40 states and Washington D.C. as of 2023; the passing score is between 260 and 270 in most jurisdictions. According to the comparison in OpenAI’s technical report, GPT-3.5 scored 213/400 on this simulated exam and failed. GPT-4, released about three and a half months later, scored 298/400 on the same exam: a passing score in all UBE jurisdictions. OpenAI presented this as roughly top 10% performance; although this percentile interpretation was later found methodologically debatable, the raw score difference was striking in itself.

About a hundred days. From 213 to 298. Think about a human’s preparation for the bar exam: months of intense study, maybe multiple attempts. There is no “studying” here of the kind a human does: what happened was a change in the model’s scale and training process. And this is just a single example. During the same period, similar leaps occurred simultaneously in coding, math, scientific reasoning, and visual understanding.

The leaps didn’t stop in 2024 and beyond. In July 2024, DeepMind’s AlphaProof and AlphaGeometry 2 systems achieved silver-medal level success by solving four out of six problems at the International Mathematical Olympiad, the world’s most prestigious high school-level math competition. Moreover, they were just 1 point away from the gold medal threshold. Around the same time, an approach called “test-time compute” became more prominent: giving the model more compute time before answering brought massive gains, especially in reasoning questions. These examples showed that progress could come not just from more parameters, but from new axes of scaling.

No one knows for sure where this progress is heading. We don’t yet know what these systems will be able to do. The developers don’t know either. AI safety starts with taking this “we don’t know” seriously.

A difficult question remains: can we foresee the next leap?

That’s why scaling trends, benchmarks, and expert forecasts are important. In the next article, we will look at these tools.