How the eBPF verifier works, and where its proof has broken

The eBPF verifier is the piece of the Linux kernel that lets an ordinary, unprivileged program run code inside ring 0 and tries to prove, before that code ever executes, that it cannot crash, hang, or read memory it should not touch. That is an unusual bargain. Normally the kernel keeps user code at arm’s length behind a system call boundary. eBPF erases that wall on purpose, then rebuilds it as a static proof: a program is loaded as bytecode, the verifier walks every path through it, and only a program it can prove safe is allowed to run. This post takes the verifier apart from the inside, how it models registers and bounds, how it walks the program as a graph, where the proof is sound, and the real bugs where a flaw in that proof turned attacker bytecode into kernel read and write and a root shell.

Why the eBPF verifier is a security boundary

Start with what eBPF actually is, because the danger only makes sense once you see what it replaces. eBPF lets a user attach a small program to a hook inside the kernel: a network packet arriving, a system call entering, a tracepoint firing. The program runs in kernel context, with kernel speed, on kernel data. There is no context switch and no copy across a boundary. That is the whole point. It is also the whole problem.

On many distributions, loading some classes of eBPF program does not require root. An ordinary local user can hand the kernel a blob of bytecode and ask it to run that blob in the most privileged context the machine has. Nothing else in Linux works like this. A normal process that wants kernel work to happen makes a system call and waits; the kernel does the work and hands back a result. eBPF instead accepts the code itself. So the kernel cannot trust the program, and it cannot sandbox it the cheap way with a separate address space, because the entire value of eBPF is that the program runs with no isolation at all.

That leaves exactly one option. Prove the program safe before running it. The verifier is that proof engine. It performs a static analysis of the bytecode and rejects anything it cannot show is safe. If the analysis is correct, an unprivileged user can run code in ring 0 and the worst they can do is whatever the verifier permits. If the analysis is wrong, the same user runs arbitrary code in ring 0, which is the textbook definition of privilege escalation. The verifier is not a performance feature or a linter. It is the only thing standing between an unprivileged process and the kernel’s memory.

What the verifier has to prove

The verifier’s job is narrow to state and hard to do. For every instruction on every reachable path, it must show a short list of things hold:

Every memory load and store lands inside a region the program is allowed to touch, with the right size and alignment.
Every register that gets read was written first, so the program cannot leak uninitialized kernel stack.
The program always terminates, so it cannot hang the kernel in an unbounded loop.
Pointers are never leaked to user space as raw numbers, and pointer arithmetic never wanders a pointer out of its object.
Helper functions are called with arguments of the type and range they expect.

The hard one is memory access. A store like *(u64 *)(r1 + r2) = r3 is safe only if the kernel can be certain, at verification time, that r1 + r2 points somewhere legal for all values r2 could take at run time. The verifier does not get to run the program to find out. It has to reason about every possible value of r2 using nothing but the bytecode. To do that it builds an abstract model of what each register could hold.

How the proof works: registers, tnums, and bounds

The verifier runs an abstract interpretation. Instead of tracking the concrete value in each register, which it cannot know, it tracks a set of possible values, and it updates that set as it simulates each instruction. The kernel keeps a struct bpf_reg_state for all eleven registers plus the stack slots. Two parts of that state matter most.

tnum: which bits are known

The first is the tnum, short for tracked number. A tnum is a pair of 64 bit fields, a mask and a value. The kernel docs put it plainly: ones in the mask are bits whose value is unknown, and ones in the value are bits known to be one. So a register the verifier knows nothing about has an all ones mask. A register known to be exactly 8 has a zero mask and a value of 8. After an instruction like r0 &= 0xff, the verifier can mark the top 56 bits as known zero, because anding with a constant clears them no matter what was there before. The tnum is how the verifier reasons about bitwise operations and alignment without ever knowing the concrete number.

min and max bounds

The second part is a set of range bounds. For each register the verifier tracks a minimum and maximum read as unsigned, umin_value and umax_value, and a minimum and maximum read as signed, smin_value and smax_value. A conditional branch refines these. If the program does if (r2 > 8) goto ..., then on the path where the branch is taken the verifier sets r2‘s umin_value to 9, and on the fall through path it caps umax_value at 8. The branch teaches the verifier something true about the register on each side, and the verifier records it.

The tnum and the bounds describe the same register from two angles, and the verifier keeps them in sync. A known bit pattern can tighten a numeric range, and a numeric range can reveal that certain high bits must be zero. That cross talk between the two representations is where the proof gets its strength, and, as we will see, where it has repeatedly gone wrong.

Put it together with an example. The program loads an attacker controlled value into r2, then masks it: r2 &= 0x7. Now the verifier knows, from the tnum, that r2 is between 0 and 7. The program uses r2 as an index into a map value that is 8 bytes long. Because 0 through 7 are all in bounds, the verifier proves the access is safe and lets it through. The attacker never controlled the verifier’s belief, only the run time value, and the belief was true for every value. That is the proof working.

Walking the program as a graph

A proof about one instruction is easy. The verifier has to prove the whole program, and a program has branches, so the values reaching any instruction depend on the path taken to get there. The verifier handles this in two passes.

First it does a check on the control flow graph. It treats the program as a directed graph and rejects anything with an unbounded back edge, which is how it forbids loops the old way. Bounded loops are allowed in newer kernels, but the verifier still has to prove they terminate. No loop it cannot bound gets to run, because a kernel program that never returns is a kernel that never returns.

Second it walks the graph. Starting at the first instruction, it descends every reachable path, simulating each instruction and updating the register and stack state as it goes. At a branch it explores both sides, each with its own refined bounds. This is a path sensitive analysis, and it is exactly as expensive as it sounds. A program with many branches has a number of paths that grows toward exponential, and the verifier walks them.

The complexity limit and state pruning

Two mechanisms keep that walk from running forever. The first is a hard ceiling: the verifier will examine at most one million instructions across all paths before it gives up and rejects the program. This is a real number in the kernel and it is a security control, not just a resource guard. A program complex enough to exhaust the analysis is refused rather than trusted.

The second is state pruning, and it is the clever part. When the verifier reaches an instruction it has visited before on another path, it compares the current register and stack state to the states it recorded earlier. If a previous state was at least as general as the current one, meaning everything safe then is still safe now, the verifier stops walking this path. It already proved the rest. The functions states_equal and regsafe decide whether one state is covered by another. Pruning is what makes the verifier fast enough to be usable. It is also a place where a wrong judgment about whether two states are equivalent can skip the analysis of a path that was not actually safe.

The verifier does not check what a program does. It proves what a program could do, over every value and every path, using an abstract model. The dangerous bugs all live in the gap between that model and the silicon it stands in for.

Where the proof has broken: bounds tracking CVEs

The verifier is sound only if its abstract model never claims a register is more constrained than it really is. The instant the model believes a register is bounded when the true run time value is not, the proof certifies an out of bounds access as safe, and the attacker gets to read or write kernel memory. Several of the worst Linux local privilege escalations of recent years are exactly this failure. Finding them is the same discipline we describe in how hackers find vulnerabilities: understand what the system assumes, then look for the case where the assumption is false.

CVE-2020-8835: 32 bit bounds and a false belief

CVE-2020-8835, found by Manfred Paul, lived in how the verifier handled bounds for 32 bit operations. All bounds were tracked on the full 64 bit register, and the logic that tried to learn something about the lower 32 bits from a 32 bit jump made a wrong inference. The flaw, in plain terms: the verifier saw that a register’s unsigned minimum and unsigned maximum both ended in the same low bits and concluded that every value in between shared those low bits too. That does not follow. If a register ranges from 1 to 2 to the 32nd plus 1, the endpoints share a low bit pattern, but a value like 2 sits between them with completely different low bits.

An attacker built a register the verifier believed was pinned to a single safe value, usually zero, while the real value was attacker controlled. The program loaded a mystery number from a map, so its true value was hidden from static analysis, then used crafted 32 bit comparisons to trigger the faulty deduction. The verifier now trusted a bound that was a lie. Every pointer arithmetic step looked individually within limits to the verifier’s sanitation logic, but the combined offset walked the pointer clean out of the map. The result was an out of bounds read and write in kernel memory, and from there a path to administrative privileges. The fix corrected the 32 bit bounds deduction. The mitigation, the same one that applies to this whole class, was setting kernel.unprivileged_bpf_disabled to stop unprivileged users from loading programs at all.

CVE-2021-3490: ALU32 bitwise operations

A year later, CVE-2021-3490, also credited to Manfred Paul, hit the same soft spot from a different angle. The kernel had added explicit 32 bit, or ALU32, bounds tracking in 5.7. The bug was that the routines updating those 32 bit bounds for the bitwise operations AND, OR, and XOR did not always update them correctly. After one of these operations the 32 bit bounds could be left wider, or in the XOR case stale, compared to the truth the verifier should have derived from the operands.

The shape of the exploit is the same as before because the underlying failure is the same. Produce a register whose tracked bounds are tighter than the real value, walk a pointer past the end of a map using offsets the verifier believes are safe, and you have an out of bounds primitive in the kernel. The advisory states the consequence directly: the mishandled 32 bit bounds could be turned into out of bounds reads and writes, and therefore arbitrary code execution. The fix corrected the bound updates for the bitwise ops. The pattern across both CVEs is hard to miss. The 32 bit side of bounds tracking, where a value has to be reasoned about as both a 64 bit and a 32 bit quantity, is where the abstract model keeps drifting away from reality.

The speculative twist: when the model is right and the CPU still cheats

There is a second family of verifier problem that is more unsettling, because here the verifier’s logic is correct and the hardware still betrays it. Spectre style attacks exploit speculative execution: a CPU runs past a branch before it knows the branch outcome, and a load done in that speculative window can pull data into the cache even though the result is later thrown away. A bounds check that the verifier proved sufficient does nothing during speculation, because the processor speculates straight past it.

So an eBPF program the verifier honestly proved safe could still leak kernel memory through a cache side channel, by getting the CPU to speculatively read out of bounds and then measuring the cache. The verifier’s response was to grow new responsibilities. It now simulates speculative paths, the ones a mispredicted branch would take, and where it cannot rule out a speculative bounds bypass it inserts a speculation barrier, an internal nospec instruction not available to user space, to stop the CPU from running past the check. The proof had to expand from what the program does to what the silicon might speculatively do on its behalf. That is a much larger thing to prove, and it is still being hardened.

Why this class of bug keeps coming back

Look at the three failures together and a shape appears. In every case the verifier did not crash or obviously malfunction. It produced a confident, wrong answer. It proved a program safe that was not, because its model of a register disagreed, in one specific corner, with what the register would really hold. The attacker did not break the verifier. The attacker found the gap between the proof and the truth and lived in it.

That is hard to stamp out for a structural reason. The verifier is doing abstract interpretation over a model with several representations of a value, full register bounds, 32 bit bounds, signed bounds, unsigned bounds, and the tnum, and it has to keep all of them consistent with each other through every arithmetic, bitwise, and comparison instruction. Each of those update routines is a small piece of mathematics that has to be exactly right for every input. One off by a corner case and the model says bounded where the truth says free. The 32 bit bounds CVEs were precisely that, twice, in the seam where 64 bit and 32 bit reasoning meet.

Researchers have started attacking the verifier the way you would attack any safety proof, by checking the proof itself. Work like the range analysis verification effort takes the kernel’s bounds tracking functions and checks them against a reference using an automated solver, looking for any input where the verifier’s claimed bounds do not contain the real result. That is a sound way to find this bug class, because it targets the exact property that has to hold and that the CVEs violated: the abstract bounds must always be a superset of the concrete value, never a subset.

The boundary that runs through a proof

Strip away the registers and the tnums and the graph walk and one assumption is left holding the whole thing up. The kernel assumes that if the verifier accepted a program, the program is safe, and it then runs that program with full kernel privilege. Everything rides on the verifier’s answer being not just usually right but right for every value on every path, including paths the CPU only takes speculatively. The interesting bugs are never in the part of the proof that works. They live in the narrow place where the model and the machine disagree, a 32 bit bound that does not follow from a 64 bit one, a bitwise update that forgot a case, a check the silicon speculates past.

That is the whole lesson, and it generalizes well past the kernel. Any system that decides to trust input because it proved the input safe is only as strong as the gap between what it proved and what is true. Finding that gap means understanding what the system assumes and then hunting for the case where the assumption quietly fails, which is exactly the kind of work an autonomous researcher built to test assumptions, rather than match known payloads, is meant to do. The verifier is one of the most carefully built proof engines in Linux, and it has still been wrong in ways that handed out the kernel. That is not a knock on the verifier. It is the nature of proving an untrusted program safe to run in ring 0.

Frequently asked questions

What does the eBPF verifier actually do?

It is a static analysis engine inside the Linux kernel that inspects eBPF bytecode before it runs and tries to prove it is safe. It walks every reachable path, models what each register could hold using bit level tracking and numeric bounds, and rejects any program where it cannot show that all memory accesses are in bounds, the program terminates, and no uninitialized or pointer data leaks. Only a program it can prove safe is allowed to run in kernel context. The kernel documents the design at kernel.org.

Why is the verifier a security boundary?

On many systems an unprivileged local user can load some eBPF programs, and those programs run in ring 0 with full kernel privilege and no address space isolation. There is no system call wall to hide behind, so the only thing keeping that code from touching kernel memory is the verifier’s proof. If the proof is correct the user is contained. If the proof is wrong, the same user runs arbitrary code in the kernel, which is privilege escalation.

How did bounds tracking bugs like CVE-2021-3490 lead to privilege escalation?

The verifier proves a memory access is safe by tracking the range a register can hold. In CVE-2021-3490 the 32 bit bounds for the bitwise operations AND, OR and XOR were not updated correctly, so the verifier believed a register was more constrained than its real run time value. The attacker used that false belief to walk a pointer past the end of a map, giving an out of bounds read and write in kernel memory and a path to code execution. Details are in the NVD entry for CVE-2021-3490.

Can the verifier stop Spectre style speculative attacks?

Not by bounds checking alone. A CPU can speculatively run past a bounds check the verifier proved sufficient, do an out of bounds load, and leak the data through a cache side channel even though the result is discarded. To handle this the verifier now simulates speculative paths and, where it cannot rule out a speculative bounds bypass, inserts an internal nospec speculation barrier so the processor cannot run past the check. The proof had to grow from what the program does to what the hardware might speculatively do.