Many shot jailbreaking is a way to talk a language model into answering a harmful question by burying that question at the end of a very long prompt full of fake examples. The attacker writes out dozens or even hundreds of invented dialogues where a pretend assistant cheerfully answers the kind of request the real model is trained to refuse. Then they ask the real question. The model has just read a stack of evidence that the expected behavior here is to comply, so a fair share of the time it does.
In context learning, the capability the attack abuses
A language model learns two ways. The slow way is training, where weights are tuned over a huge corpus and then frozen. The fast way is in context learning, which happens at inference time and changes nothing in the weights. You show the model a few examples inside the prompt, pairs of input and the output you want, and it picks up the pattern and applies it to the next input. Give it three English sentences paired with their French translations and a fourth English sentence, and it will translate, even though you never used the word translate.
People rely on this every day to steer models without retraining. It is also the exact mechanism many shot jailbreaking turns against the model. The same pull that makes a model copy your translation examples makes it copy a long run of examples where an assistant answers dangerous questions. The model is not judging whether the examples are legitimate, it reads them as a signal of what comes next.
Why long context windows changed the threat model
For a long time prompts were short. A model might accept a couple of thousand tokens, room for a handful of in context examples and little else. With only a few examples to work from, a refusal trained into the model usually wins, because the attacker cannot show the behavior enough times to overpower what the model learned in training.
Then context windows grew by orders of magnitude, into the hundreds of thousands of tokens and beyond. That space was added for good reasons, such as reading whole documents or large codebases at once. But the same room that holds a long document holds a long list of fabricated dialogues, and the attacker now has space for hundreds of fake examples in one prompt. The capability that makes long context useful is the capability that makes this attack possible. A bigger window is a bigger surface.
How many shot jailbreaking is built
The structure is plain, which is part of why it works. The prompt is one long sequence of turns that all follow the same shape: a question that should be refused, followed by a fake assistant answer that complies. Only at the very end does the attacker place the question they actually care about, in the same format as the staged turns before it. Here is the abstract shape, with placeholders standing in for content that would never appear in a real defensive writeup:
User: [a question of a type the model should refuse] Assistant: [a fabricated answer where the fake assistant complies] User: [another such question] Assistant: [another fabricated compliant answer] ... repeated dozens to hundreds of times ... User: [the attacker's real target question, same format] Assistant:
By the final turn the model has read a wall of in context evidence that the assistant here answers these questions, so the fabricated turns outweigh the refusal it would otherwise give.
Why it works: the success rate scales with the number of shots
This is not hit or miss. As the number of fake examples, the shots, goes up, the probability of a harmful response goes up too. Researchers who studied this found the effectiveness follows a power law over a wide range of shot counts, climbing steadily as you add more examples until it levels off. Few examples, little effect. Many examples, a much higher chance of compliance.
The reason this matters is the link back to in context learning. The helpful kind follows the same shape of scaling curve as the number of demonstrations grows. The jailbreak is not a separate trick that happens to scale. It is in context learning working as designed, pointed at a behavior you did not want.
The model is doing what it was built to do, learn from the examples in front of it. The attacker just chose the examples.
It generalizes, and stronger models can be more exposed
Two findings make this harder to wave away. The first is that the effect is not tied to one kind of request. The same many example structure raises compliance across many different task types, because in context learning is general by nature. It is not a keyword trick aimed at one topic.
The second is counterintuitive. Larger and more capable models can be more susceptible, not less. A model that learns from in context examples faster and with fewer of them is, by the same token, quicker to absorb the pattern in a stack of fabricated dialogues. The quality that makes a model good at picking up your intent makes it good at picking up an attacker’s.
Defenses that hold up
The obvious idea is to shorten the context window so there is no room for hundreds of examples. That is a poor trade. Long context is one of the main reasons these models are useful, and capping it throws away the legitimate work the window was added for, while an attacker can still pack a lot into whatever window remains. The approaches that work better act on the prompt before it reaches the model:
- Fine tuning the model to recognize the pattern. Train the model on examples of this attack so it learns to treat a long run of staged compliant dialogues as a red flag and refuse at the end no matter how many examples precede it. This raises the bar but does not always close the gap.
- Classifier based input filtering. Run incoming prompts through a separate classifier that looks for the telltale structure, many repeated turns of question and compliant answer in the same format, and flag or strip them before they reach the model. Catching the shape, not just the words, is the point, because the words vary but the structure repeats.
- Prompt modification. Rewrite or reformat the incoming prompt to break the demonstrated pattern, so the staged turns no longer read as a clean run of examples to imitate.
The common thread is that you intervene on the input rather than asking the frozen model to resist a pull it was built to feel. None of these is a clean fix on its own, and stacking them is the honest posture. The scaling behavior comes from published research across many tasks, but what any given attacker achieves depends on the model and the filtering in front of it, so the trend is real while the exact numbers vary by setup.
The broader lesson
Many shot jailbreaking sits next to other prompt level attacks that turn a model’s own behavior into the weapon, such as indirect prompt injection and system prompt extraction. They all share a shape. A feature the model was given on purpose, reading external content, holding a hidden system prompt, learning from in context examples, is also the way in. The capability is the attack surface.
The bug here is an assumption baked into how the system is used: that the examples in a prompt are there to help. An attacker who fills that space with fabricated examples is not breaking a rule, they are using the model exactly as designed against a goal nobody approved. Finding flaws of that kind means asking what each capability quietly assumes, which is the approach behind UnboundCompute, an autonomous security researcher that tests a web application’s assumptions and proves what it finds with evidence. Learn more on our about page.
Frequently asked questions
What is many shot jailbreaking?
It is a long context attack that fills a prompt with many fabricated dialogues where a fake assistant answers harmful questions, then places the attacker’s real question at the end. The model reads the staged examples as a demonstration of how it should respond and is more likely to comply. The structure repeats the same question and compliant answer shape dozens to hundreds of times.
Why does many shot jailbreaking work?
It abuses in context learning, the way a model picks up a pattern from examples inside the prompt without any change to its weights. As the number of fake examples grows, the chance of a harmful response rises in a regular power law pattern. The attack is the same mechanism that makes helpful in context examples work, just pointed at a behavior you did not want.
Are larger models safer against this attack?
Not necessarily, and sometimes the opposite. Larger and more capable models tend to learn from in context examples faster and with fewer of them. That same speed makes them quicker to absorb the pattern in a stack of fabricated dialogues, so capability and exposure can rise together.
How do you defend against many shot jailbreaking?
Shrinking the context window is a poor trade because it throws away the long context that makes the model useful. Better defenses act on the prompt before it reaches the model: fine tuning the model to recognize the attack pattern, and classifier based filtering that detects the many example structure in the input. Stacking these and treating the repeated staged turns as a signal is the honest posture.
