Category: Deep Dives

Long form technical deep dives into one mechanism at a time: cloud, kernel, IoT, and privacy internals.

  • What Actually Happens In A Kernel Use After Free

    What Actually Happens In A Kernel Use After Free

    A kernel use after free is one of the few bugs that can turn an ordinary local user into root without ever touching a password file. The shape of the bug is simple to state. Some piece of kernel code frees an object, then keeps using a pointer to it. The allocator, meanwhile, hands that same memory to a different object the attacker controls. From the moment of reuse the kernel is reading and writing through a pointer that no longer means what it thinks it means. This post goes to the metal: how the kernel heap is laid out, what a freed object actually looks like in memory, the exact instant a freed slot gets reused by an attacker chosen object, and why that single overlap becomes a privilege escalation primitive rather than just a crash.

    The kernel heap is not one big pool

    Userspace programmers picture the heap as a single arena that malloc carves up. The kernel works differently, and the difference is the whole reason these bugs are exploitable in the way they are. The kernel allocates small objects through the SLUB allocator, which does not manage one pool. It manages many small pools, each one dedicated to objects of a particular size.

    When kernel code calls kmalloc(200, GFP_KERNEL), the request is rounded up to the next size class and served from a cache named for that class. There is a kmalloc-256 cache, a kmalloc-512, a kmalloc-1024, and so on. Each cache owns a set of slabs, where a slab is one or more contiguous pages of memory sliced into equal sized object slots. A kmalloc-256 slab built from a single 4096 byte page holds sixteen slots of 256 bytes each. Every object that the kernel allocates at that size lands in one of those slots.

    This matters because objects of the same size share a cache. A network buffer, a filesystem structure, and a credential record can all be 256 bytes, and if so they compete for slots in the same kmalloc-256 slab. That shared residency is the soil every use after free grows in. To reuse a freed object as something dangerous, an attacker needs the kernel to place the dangerous object in the slot that was just vacated. Same size, same cache, same slab. The allocator is doing exactly its job. The attacker is just choosing what fills the hole.

    What a freed object actually looks like

    Here is the detail most explanations skip. When SLUB frees an object, it does not zero it and it does not hand it back to the page allocator. It threads the slot onto a free list, and the free list lives inside the freed objects themselves. SLUB writes the address of the next free object into the first bytes of the slot being freed. The freed memory becomes a node in a singly linked list of holes.

    kmem_cache_cpu.freelist  -->  slot A
    slot A: [ next = &slot C ][ stale leftover bytes ... ]
    slot C: [ next = &slot D ][ stale leftover bytes ... ]
    slot D: [ next = NULL    ][ stale leftover bytes ... ]

    Two facts fall out of this layout. First, a freed object still contains its old contents past the embedded free pointer, so a dangling pointer can often still read meaningful stale data. Second, allocation is a pop from the head of this list. The per cpu structure kmem_cache_cpu holds a freelist field pointing at the first free slot. To allocate, SLUB reads the next pointer out of that slot, sets the free list head to it, and returns the slot. To free, it writes the current head into the slot and points the head at the slot. Allocation is last in, first out. The most recently freed object of a given size is the very next one handed out.

    That ordering is a gift to an attacker. Free the victim, then immediately allocate an object of the same size, and you get the victim’s slot back with high reliability. No guessing, no spray needed in the simplest case. The allocator’s own efficiency hands the freed slot straight back.

    The exact moment of reuse in a kernel use after free

    Now we can describe a kernel use after free with precision instead of hand waving. Walk the timeline of a single slot.

    • At time one the kernel allocates object X into slot S and stores a pointer to it somewhere, say a field in a longer lived structure. The pointer is the reference.
    • At time two some code path frees X. SLUB threads slot S onto the free list. The reference the kernel kept is now dangling. It still points at slot S, but slot S is officially free memory.
    • At time three the attacker triggers an allocation of an object Y of the same size class. SLUB pops slot S off the free list and returns it. Object Y now lives in slot S, and crucially the attacker controls the bytes written into Y.
    • At time four the kernel uses the dangling reference, believing it still points at object X. It reads or writes through that pointer. But the bytes there are now object Y, filled by the attacker.

    The reuse at time three is the hinge. Before it, the dangling pointer points at junk and the worst case is a crash. After it, the dangling pointer points at a structure whose contents the attacker chose. The kernel is about to interpret attacker data as a trusted object. Everything that makes this a privilege escalation rather than a denial of service happens in the gap between the kernel’s mental model, which says slot S is still object X, and the physical reality, which says slot S is now object Y.

    A use after free is not a memory error in the usual sense. It is a disagreement about ownership. Two objects believe they own the same bytes, and the attacker controls which belief the CPU acts on.

    Heap grooming: making the right object land in the hole

    In a real bug the freed slot and the reuse rarely line up by luck, so attackers shape the heap first. This is heap grooming, sometimes called heap feng shui. The goal is to arrange the free list so the slot you are about to free, and then reclaim, is predictable.

    A common move is to allocate a run of filler objects to fill partially used slabs, free a few at chosen positions to open known holes, then trigger the bug so the vulnerable object lands next to or inside a slot you understand. After the free, the attacker sprays many copies of the replacement object so that even with some noise from other kernel activity, one of the sprayed copies almost certainly captures the freed slot. Message queue objects, socket buffers, and extended attribute buffers are popular spray vehicles because their size is attacker controlled and their contents are largely attacker controlled too. You pick a spray object whose size rounds into the same kmalloc cache as the victim, because reuse only works inside one cache.

    There is a second reason grooming is necessary, and it comes from the per cpu free list. SLUB keeps a hot free list per CPU core. If the free and the reclaiming allocation run on different cores, they touch different free lists and the reclaim can miss. Exploits often pin themselves to one CPU with sched_setaffinity so the free and the spray hit the same per cpu list, restoring the clean last in, first out behavior the attack depends on. They also keep the spray objects in their own size band when they want the freed slot to come from a fresh slab rather than a busy one. These are small operational details, but they are the difference between a use after free that reclaims on the first try and one that reclaims one time in fifty.

    Cache merging widens the field

    SLUB also merges caches to save memory. Two caches that ask for the same object size and compatible flags can be folded into one shared cache at boot. The practical effect for an attacker is that an object you would expect to be isolated may in fact share a slab with general kmalloc allocations of the same size, because the kernel merged them. That expands the set of objects you can use to reclaim a freed slot. It also explains why a defense as simple as giving a sensitive structure a dedicated, non mergeable cache closes a whole class of reuse. If the victim cannot share a slab with anything you can spray, you cannot reclaim its freed slot with a chosen object, and the use after free loses its teeth.

    Why reuse becomes power: choosing the victim object

    Reuse alone is not escalation. What makes a use after free a root shell is the choice of which object reclaims the freed slot. The attacker wants an object that, once it overlaps the dangling reference, gives control over something the kernel trusts. Three classic targets show the range.

    A function pointer you can aim

    Some kernel objects hold a pointer to an operations table, a struct full of function pointers the kernel calls to do work. struct pipe_buffer is the textbook example. It carries a field ops that points at a static table such as anon_pipe_buf_ops, and the kernel calls through that table when a pipe is read, released, or confirmed. If an attacker reclaims a freed slot with a pipe_buffer whose ops field they control, the next pipe operation calls a function pointer of the attacker’s choosing. That is control flow hijack, the path toward running a chosen sequence of kernel instructions.

    A length or pointer field you can lie about

    Other victims do not need a function pointer at all. If the reclaiming object exposes a length field or a data pointer that the kernel later trusts for a copy, overwriting it turns a bounded operation into an arbitrary read or write. A message object whose size field has been inflated lets the kernel copy far more than the original allocation, reading neighboring kernel memory back to the attacker. This is the data only road, and it does not care about code at all.

    A credential you can swap

    The cleanest escalation skips memory corruption entirely. Every process points at a struct cred that records its uid and gid. A uid of zero is root. The DirtyCred technique, presented at a 2022 conference, builds on exactly this. Rather than forging bytes, it frees a credential or file object the process relies on, then races to allocate a privileged object of the same type into the freed slot. The kernel keeps using its dangling reference, except the reference now resolves to a privileged credential. The process is root because it is pointing at root’s credentials, and no kernel address ever needed to leak. The free list did the swap.

    The file flavor of the same idea is worth seeing because it shows how little corruption a strong technique needs. An attacker opens a writable file, which the kernel checks and approves, then begins a write. Between the permission check and the actual write the attacker frees the file object through the bug and reallocates the slot with a file object opened against a read only target. The write the kernel already approved now lands on the read only file, because the reference it followed points at the swapped object. There is no forged pointer and no leaked address. The whole exploit is a well timed free and a reclaim, which is why these data only techniques survive across kernel versions and architectures that break pointer based exploits. They depend only on the allocator doing what it always does: hand a freed slot to the next request of the right size.

    A real kernel use after free walked end to end

    Concrete beats abstract, so anchor this in a documented bug. CVE-2021-22555 is a heap out of bounds write in the netfilter subsystem that had been present since Linux 2.6.19 in 2006, reachable by an unprivileged user through a user namespace. It is not itself a use after free, but the public writeup turns it into one, and the steps map onto everything above.

    The flaw is a small overflow. When the kernel translates 32 bit iptables rules into 64 bit form, a memset writes a short run of zero bytes just past the end of an allocation. A few zero bytes does not sound like much. The exploit makes it enough.

    The groom uses System V message queues, whose struct msg_msg headers carry a next pointer to a continuation segment and live in a controllable kmalloc cache. The attacker lays out primary and secondary messages so the two zero bytes land on the next pointer of a message header, clearing its low bytes and bending it to alias a second message. Now two message references point at one underlying object. Reading the message through one path frees the shared object while the other path keeps a stale reference. That stale reference is the use after free, manufactured out of a tiny overflow.

    From there the pattern is the one we built. The attacker sprays struct pipe_buffer objects to reclaim the freed slot, reads back through the dangling reference to leak the address of a static kernel table and defeat KASLR, then reclaims again with a pipe_buffer whose ops pointer is forged. Closing the pipe calls through the forged table, redirecting kernel control flow into a chain that runs commit_creds(prepare_kernel_cred(NULL)), which installs root credentials on the current process. One overflow of two zero bytes, groomed into a use after free, reclaimed by a chosen victim, escalated to root. Every link is a piece described above. The MITRE record for the bug is CVE-2021-22555.

    Why the kernel cannot just notice

    A fair question is why the kernel does not simply detect that an object was freed and refuse to use it. The answer is that at the machine level there is nothing to detect. A pointer is a number. A freed slot is the same bytes it was a microsecond ago, minus the embedded free pointer SLUB wrote at the front. The CPU dereferencing a dangling pointer sees a valid mapped address with plausible contents. Nothing faults. The type system that would have caught this lived in the source code and was compiled away.

    Defenses therefore attack the mechanics rather than the intent. Freelist pointer hardening, enabled by CONFIG_SLAB_FREELIST_HARDENED, stores the embedded next pointer obfuscated rather than raw. Instead of writing the next address plainly, SLUB stores it as the address XORed with a per cache random secret and with the slot’s own location, so a value computed roughly as ptr ^ slab_secret ^ slot_address. An attacker who overwrites a freed slot can no longer forge a valid free pointer without knowing the secret, which blocks the trick of pointing the free list at an arbitrary address. Cache separation moves sensitive objects out of the general kmalloc caches so they cannot share a slab with attacker controlled sprays. Credentials, for example, were given their own dedicated cache with account flags so they no longer merge with general allocations, which is why straightforward credential overwrites stopped working and attackers moved to cross cache techniques. Allocator quarantine and randomization delay and shuffle reuse so that the clean last in, first out reclaim is no longer a sure thing.

    None of these make the underlying bug disappear. They raise the cost of the step between free and reuse. That is the honest framing: the dangling pointer is still wrong, the hardening only makes the wrongness harder to convert into control. Spotting the dangling pointer in the first place is a reasoning problem, the same kind of assumption testing covered in our piece on how vulnerabilities are actually found, and the escalation that follows is the classic privilege escalation story told at the level of slab slots.

    The assumption that outlived its reference

    Strip away the slabs and the spray and the forged tables and one assumption is left standing. The allocator assumes that when an object is freed, every reference to it is gone. Freeing is a promise the rest of the kernel makes: I am done with this, you may give the bytes to someone else. A use after free is that promise broken. A reference survived the free, and it kept pointing at the slot after the allocator handed those bytes to another owner.

    Everything dangerous follows from that single broken promise. The size class sharing, the last in first out reclaim, the choice of a credential or a function pointer as the new tenant, all of it is just leverage applied to a reference that outlived its assumption. The allocator is not buggy and the victim object is not buggy. The bug is a pointer that should have been forgotten and was not. Finding that surviving reference, the one the code assumed could never still be live, is the whole game, and it is exactly the kind of assumption an autonomous researcher built to question what each component trusts is meant to surface before an attacker does. More on that approach is on our about page.

    Frequently asked questions

    What is a kernel use after free in simple terms?

    It is a bug where the kernel frees an object but keeps a pointer to it, then the allocator hands that same memory to a different object. When the kernel uses the old pointer it reads or writes a structure that someone else now owns. If an attacker controls the contents of that new object, the kernel ends up trusting attacker chosen bytes as if they were a legitimate object.

    Why does the SLUB allocator make use after free bugs exploitable?

    SLUB serves objects from per size caches like kmalloc-256, and objects of the same size share slabs. It threads freed slots onto a free list stored inside the freed objects, and allocation pops from the head, so the most recently freed slot is the next one returned. An attacker frees the victim then immediately allocates a same size object to reclaim that exact slot with reliable timing.

    How does a use after free turn into root access?

    The freed slot is reclaimed by a victim object that gives control over something trusted. That can be a function pointer table like the ops field of a struct pipe_buffer, a length field that enables an arbitrary read or write, or a struct cred whose uid the attacker swaps for zero. The DirtyCred technique uses the credential swap path. A documented end to end example is CVE-2021-22555 in netfilter.

    Can the kernel detect a dangling pointer on its own?

    Not at runtime. A pointer is just a number and a freed slot still holds plausible bytes, so dereferencing it does not fault. Mitigations such as CONFIG_SLAB_FREELIST_HARDENED, dedicated caches for sensitive objects, and reuse randomization raise the cost of converting the bug into control, but they do not remove the surviving reference. The kernel.org documentation describes the hardening option at kernel self protection.

  • How the eBPF verifier works, and where its proof has broken

    How the eBPF verifier works, and where its proof has broken

    The eBPF verifier is the piece of the Linux kernel that lets an ordinary, unprivileged program run code inside ring 0 and tries to prove, before that code ever executes, that it cannot crash, hang, or read memory it should not touch. That is an unusual bargain. Normally the kernel keeps user code at arm’s length behind a system call boundary. eBPF erases that wall on purpose, then rebuilds it as a static proof: a program is loaded as bytecode, the verifier walks every path through it, and only a program it can prove safe is allowed to run. This post takes the verifier apart from the inside, how it models registers and bounds, how it walks the program as a graph, where the proof is sound, and the real bugs where a flaw in that proof turned attacker bytecode into kernel read and write and a root shell.

    Why the eBPF verifier is a security boundary

    Start with what eBPF actually is, because the danger only makes sense once you see what it replaces. eBPF lets a user attach a small program to a hook inside the kernel: a network packet arriving, a system call entering, a tracepoint firing. The program runs in kernel context, with kernel speed, on kernel data. There is no context switch and no copy across a boundary. That is the whole point. It is also the whole problem.

    On many distributions, loading some classes of eBPF program does not require root. An ordinary local user can hand the kernel a blob of bytecode and ask it to run that blob in the most privileged context the machine has. Nothing else in Linux works like this. A normal process that wants kernel work to happen makes a system call and waits; the kernel does the work and hands back a result. eBPF instead accepts the code itself. So the kernel cannot trust the program, and it cannot sandbox it the cheap way with a separate address space, because the entire value of eBPF is that the program runs with no isolation at all.

    That leaves exactly one option. Prove the program safe before running it. The verifier is that proof engine. It performs a static analysis of the bytecode and rejects anything it cannot show is safe. If the analysis is correct, an unprivileged user can run code in ring 0 and the worst they can do is whatever the verifier permits. If the analysis is wrong, the same user runs arbitrary code in ring 0, which is the textbook definition of privilege escalation. The verifier is not a performance feature or a linter. It is the only thing standing between an unprivileged process and the kernel’s memory.

    What the verifier has to prove

    The verifier’s job is narrow to state and hard to do. For every instruction on every reachable path, it must show a short list of things hold:

    • Every memory load and store lands inside a region the program is allowed to touch, with the right size and alignment.
    • Every register that gets read was written first, so the program cannot leak uninitialized kernel stack.
    • The program always terminates, so it cannot hang the kernel in an unbounded loop.
    • Pointers are never leaked to user space as raw numbers, and pointer arithmetic never wanders a pointer out of its object.
    • Helper functions are called with arguments of the type and range they expect.

    The hard one is memory access. A store like *(u64 *)(r1 + r2) = r3 is safe only if the kernel can be certain, at verification time, that r1 + r2 points somewhere legal for all values r2 could take at run time. The verifier does not get to run the program to find out. It has to reason about every possible value of r2 using nothing but the bytecode. To do that it builds an abstract model of what each register could hold.

    How the proof works: registers, tnums, and bounds

    The verifier runs an abstract interpretation. Instead of tracking the concrete value in each register, which it cannot know, it tracks a set of possible values, and it updates that set as it simulates each instruction. The kernel keeps a struct bpf_reg_state for all eleven registers plus the stack slots. Two parts of that state matter most.

    tnum: which bits are known

    The first is the tnum, short for tracked number. A tnum is a pair of 64 bit fields, a mask and a value. The kernel docs put it plainly: ones in the mask are bits whose value is unknown, and ones in the value are bits known to be one. So a register the verifier knows nothing about has an all ones mask. A register known to be exactly 8 has a zero mask and a value of 8. After an instruction like r0 &= 0xff, the verifier can mark the top 56 bits as known zero, because anding with a constant clears them no matter what was there before. The tnum is how the verifier reasons about bitwise operations and alignment without ever knowing the concrete number.

    min and max bounds

    The second part is a set of range bounds. For each register the verifier tracks a minimum and maximum read as unsigned, umin_value and umax_value, and a minimum and maximum read as signed, smin_value and smax_value. A conditional branch refines these. If the program does if (r2 > 8) goto ..., then on the path where the branch is taken the verifier sets r2‘s umin_value to 9, and on the fall through path it caps umax_value at 8. The branch teaches the verifier something true about the register on each side, and the verifier records it.

    The tnum and the bounds describe the same register from two angles, and the verifier keeps them in sync. A known bit pattern can tighten a numeric range, and a numeric range can reveal that certain high bits must be zero. That cross talk between the two representations is where the proof gets its strength, and, as we will see, where it has repeatedly gone wrong.

    Put it together with an example. The program loads an attacker controlled value into r2, then masks it: r2 &= 0x7. Now the verifier knows, from the tnum, that r2 is between 0 and 7. The program uses r2 as an index into a map value that is 8 bytes long. Because 0 through 7 are all in bounds, the verifier proves the access is safe and lets it through. The attacker never controlled the verifier’s belief, only the run time value, and the belief was true for every value. That is the proof working.

    Walking the program as a graph

    A proof about one instruction is easy. The verifier has to prove the whole program, and a program has branches, so the values reaching any instruction depend on the path taken to get there. The verifier handles this in two passes.

    First it does a check on the control flow graph. It treats the program as a directed graph and rejects anything with an unbounded back edge, which is how it forbids loops the old way. Bounded loops are allowed in newer kernels, but the verifier still has to prove they terminate. No loop it cannot bound gets to run, because a kernel program that never returns is a kernel that never returns.

    Second it walks the graph. Starting at the first instruction, it descends every reachable path, simulating each instruction and updating the register and stack state as it goes. At a branch it explores both sides, each with its own refined bounds. This is a path sensitive analysis, and it is exactly as expensive as it sounds. A program with many branches has a number of paths that grows toward exponential, and the verifier walks them.

    The complexity limit and state pruning

    Two mechanisms keep that walk from running forever. The first is a hard ceiling: the verifier will examine at most one million instructions across all paths before it gives up and rejects the program. This is a real number in the kernel and it is a security control, not just a resource guard. A program complex enough to exhaust the analysis is refused rather than trusted.

    The second is state pruning, and it is the clever part. When the verifier reaches an instruction it has visited before on another path, it compares the current register and stack state to the states it recorded earlier. If a previous state was at least as general as the current one, meaning everything safe then is still safe now, the verifier stops walking this path. It already proved the rest. The functions states_equal and regsafe decide whether one state is covered by another. Pruning is what makes the verifier fast enough to be usable. It is also a place where a wrong judgment about whether two states are equivalent can skip the analysis of a path that was not actually safe.

    The verifier does not check what a program does. It proves what a program could do, over every value and every path, using an abstract model. The dangerous bugs all live in the gap between that model and the silicon it stands in for.

    Where the proof has broken: bounds tracking CVEs

    The verifier is sound only if its abstract model never claims a register is more constrained than it really is. The instant the model believes a register is bounded when the true run time value is not, the proof certifies an out of bounds access as safe, and the attacker gets to read or write kernel memory. Several of the worst Linux local privilege escalations of recent years are exactly this failure. Finding them is the same discipline we describe in how hackers find vulnerabilities: understand what the system assumes, then look for the case where the assumption is false.

    CVE-2020-8835: 32 bit bounds and a false belief

    CVE-2020-8835, found by Manfred Paul, lived in how the verifier handled bounds for 32 bit operations. All bounds were tracked on the full 64 bit register, and the logic that tried to learn something about the lower 32 bits from a 32 bit jump made a wrong inference. The flaw, in plain terms: the verifier saw that a register’s unsigned minimum and unsigned maximum both ended in the same low bits and concluded that every value in between shared those low bits too. That does not follow. If a register ranges from 1 to 2 to the 32nd plus 1, the endpoints share a low bit pattern, but a value like 2 sits between them with completely different low bits.

    An attacker built a register the verifier believed was pinned to a single safe value, usually zero, while the real value was attacker controlled. The program loaded a mystery number from a map, so its true value was hidden from static analysis, then used crafted 32 bit comparisons to trigger the faulty deduction. The verifier now trusted a bound that was a lie. Every pointer arithmetic step looked individually within limits to the verifier’s sanitation logic, but the combined offset walked the pointer clean out of the map. The result was an out of bounds read and write in kernel memory, and from there a path to administrative privileges. The fix corrected the 32 bit bounds deduction. The mitigation, the same one that applies to this whole class, was setting kernel.unprivileged_bpf_disabled to stop unprivileged users from loading programs at all.

    CVE-2021-3490: ALU32 bitwise operations

    A year later, CVE-2021-3490, also credited to Manfred Paul, hit the same soft spot from a different angle. The kernel had added explicit 32 bit, or ALU32, bounds tracking in 5.7. The bug was that the routines updating those 32 bit bounds for the bitwise operations AND, OR, and XOR did not always update them correctly. After one of these operations the 32 bit bounds could be left wider, or in the XOR case stale, compared to the truth the verifier should have derived from the operands.

    The shape of the exploit is the same as before because the underlying failure is the same. Produce a register whose tracked bounds are tighter than the real value, walk a pointer past the end of a map using offsets the verifier believes are safe, and you have an out of bounds primitive in the kernel. The advisory states the consequence directly: the mishandled 32 bit bounds could be turned into out of bounds reads and writes, and therefore arbitrary code execution. The fix corrected the bound updates for the bitwise ops. The pattern across both CVEs is hard to miss. The 32 bit side of bounds tracking, where a value has to be reasoned about as both a 64 bit and a 32 bit quantity, is where the abstract model keeps drifting away from reality.

    The speculative twist: when the model is right and the CPU still cheats

    There is a second family of verifier problem that is more unsettling, because here the verifier’s logic is correct and the hardware still betrays it. Spectre style attacks exploit speculative execution: a CPU runs past a branch before it knows the branch outcome, and a load done in that speculative window can pull data into the cache even though the result is later thrown away. A bounds check that the verifier proved sufficient does nothing during speculation, because the processor speculates straight past it.

    So an eBPF program the verifier honestly proved safe could still leak kernel memory through a cache side channel, by getting the CPU to speculatively read out of bounds and then measuring the cache. The verifier’s response was to grow new responsibilities. It now simulates speculative paths, the ones a mispredicted branch would take, and where it cannot rule out a speculative bounds bypass it inserts a speculation barrier, an internal nospec instruction not available to user space, to stop the CPU from running past the check. The proof had to expand from what the program does to what the silicon might speculatively do on its behalf. That is a much larger thing to prove, and it is still being hardened.

    Why this class of bug keeps coming back

    Look at the three failures together and a shape appears. In every case the verifier did not crash or obviously malfunction. It produced a confident, wrong answer. It proved a program safe that was not, because its model of a register disagreed, in one specific corner, with what the register would really hold. The attacker did not break the verifier. The attacker found the gap between the proof and the truth and lived in it.

    That is hard to stamp out for a structural reason. The verifier is doing abstract interpretation over a model with several representations of a value, full register bounds, 32 bit bounds, signed bounds, unsigned bounds, and the tnum, and it has to keep all of them consistent with each other through every arithmetic, bitwise, and comparison instruction. Each of those update routines is a small piece of mathematics that has to be exactly right for every input. One off by a corner case and the model says bounded where the truth says free. The 32 bit bounds CVEs were precisely that, twice, in the seam where 64 bit and 32 bit reasoning meet.

    Researchers have started attacking the verifier the way you would attack any safety proof, by checking the proof itself. Work like the range analysis verification effort takes the kernel’s bounds tracking functions and checks them against a reference using an automated solver, looking for any input where the verifier’s claimed bounds do not contain the real result. That is a sound way to find this bug class, because it targets the exact property that has to hold and that the CVEs violated: the abstract bounds must always be a superset of the concrete value, never a subset.

    The boundary that runs through a proof

    Strip away the registers and the tnums and the graph walk and one assumption is left holding the whole thing up. The kernel assumes that if the verifier accepted a program, the program is safe, and it then runs that program with full kernel privilege. Everything rides on the verifier’s answer being not just usually right but right for every value on every path, including paths the CPU only takes speculatively. The interesting bugs are never in the part of the proof that works. They live in the narrow place where the model and the machine disagree, a 32 bit bound that does not follow from a 64 bit one, a bitwise update that forgot a case, a check the silicon speculates past.

    That is the whole lesson, and it generalizes well past the kernel. Any system that decides to trust input because it proved the input safe is only as strong as the gap between what it proved and what is true. Finding that gap means understanding what the system assumes and then hunting for the case where the assumption quietly fails, which is exactly the kind of work an autonomous researcher built to test assumptions, rather than match known payloads, is meant to do. The verifier is one of the most carefully built proof engines in Linux, and it has still been wrong in ways that handed out the kernel. That is not a knock on the verifier. It is the nature of proving an untrusted program safe to run in ring 0.

    Frequently asked questions

    What does the eBPF verifier actually do?

    It is a static analysis engine inside the Linux kernel that inspects eBPF bytecode before it runs and tries to prove it is safe. It walks every reachable path, models what each register could hold using bit level tracking and numeric bounds, and rejects any program where it cannot show that all memory accesses are in bounds, the program terminates, and no uninitialized or pointer data leaks. Only a program it can prove safe is allowed to run in kernel context. The kernel documents the design at kernel.org.

    Why is the verifier a security boundary?

    On many systems an unprivileged local user can load some eBPF programs, and those programs run in ring 0 with full kernel privilege and no address space isolation. There is no system call wall to hide behind, so the only thing keeping that code from touching kernel memory is the verifier’s proof. If the proof is correct the user is contained. If the proof is wrong, the same user runs arbitrary code in the kernel, which is privilege escalation.

    How did bounds tracking bugs like CVE-2021-3490 lead to privilege escalation?

    The verifier proves a memory access is safe by tracking the range a register can hold. In CVE-2021-3490 the 32 bit bounds for the bitwise operations AND, OR and XOR were not updated correctly, so the verifier believed a register was more constrained than its real run time value. The attacker used that false belief to walk a pointer past the end of a map, giving an out of bounds read and write in kernel memory and a path to code execution. Details are in the NVD entry for CVE-2021-3490.

    Can the verifier stop Spectre style speculative attacks?

    Not by bounds checking alone. A CPU can speculatively run past a bounds check the verifier proved sufficient, do an out of bounds load, and leak the data through a cache side channel even though the result is discarded. To handle this the verifier now simulates speculative paths and, where it cannot rule out a speculative bounds bypass, inserts an internal nospec speculation barrier so the processor cannot run past the check. The proof had to grow from what the program does to what the hardware might speculatively do.

  • Instance metadata service: the 169.254.169.254 credential leak

    Instance metadata service: the 169.254.169.254 credential leak

    The instance metadata service is a small web server that every cloud virtual machine can reach at one fixed address, 169.254.169.254, and it answers questions about the machine it runs on. Ask it nicely and it will hand back the instance ID, the network setup, the startup script, and, the part that matters most for security, a set of live cloud credentials for whatever role the instance was given. No password, no signature, just an HTTP GET from inside the box. That last detail is why a single server side request forgery bug in a web app can turn into a full cloud account takeover. This post takes the instance metadata service apart from the address up: why the magic IP exists, what lives behind it, how the credential handoff works, how attackers reach it, and the exact mechanics of the defense that AWS bolted on after it went badly wrong.

    Why there is a magic IP address at all

    Start with the address itself, because it is not arbitrary. 169.254.169.254 sits inside 169.254.0.0/16, the block reserved for link local addresses by RFC 3927. Link local means the address is only valid on the local network segment. A packet sent to it is never routed off the link and never leaves for the internet. Your laptop uses the same range when DHCP fails and it has to invent an address to talk to whatever is directly attached.

    Cloud providers borrowed that property on purpose. Every instance, in every account, in every region, reaches its metadata at the exact same IP. The address resolves to nothing on the public internet, so an instance can hardcode it and never worry about discovery. When the guest sends a packet to 169.254.169.254, the hypervisor or the host networking stack intercepts it before it goes anywhere and answers locally. There is no real server sitting at that address out in the network. The host is quietly impersonating one, on a link that only this instance can see.

    That design choice is elegant and it is also the root of the whole problem. The metadata endpoint is reachable by anything running on the instance that can open a socket. It does not check who is asking. It assumes that if a request arrived from inside the machine, the request is trusted. Hold on to that assumption, because every attack in this post is a way of making the metadata service answer a question on behalf of someone who is not trusted at all.

    What actually lives behind 169.254.169.254

    The metadata service exposes a tree of plain text, browsable like a tiny filesystem over HTTP. On AWS the root of the useful part is http://169.254.169.254/latest/meta-data/. Ask for it and you get a listing:

    ami-id
    block-device-mapping/
    hostname
    iam/
    instance-id
    instance-type
    local-ipv4
    mac
    placement/
    public-ipv4
    security-groups
    ...

    Most of this is housekeeping. instance-id and ami-id identify the machine and the image it booted from. local-ipv4 and mac describe its place on the network. placement/ tells you the availability zone. None of that is secret in any meaningful way. An automation tool reads these so it can configure itself without being told where it is running. This is the honest, boring purpose of the service, and it is genuinely useful.

    Then there is the iam/ branch, and this is where boring ends. Follow it to iam/security-credentials/ and the service lists the name of the role attached to the instance. Imagine a role called app-server-role. Ask for that name directly:

    GET http://169.254.169.254/latest/meta-data/iam/security-credentials/app-server-role

    and the response is a block of JSON that looks like this:

    {
      "Code": "Success",
      "Type": "AWS-HMAC",
      "AccessKeyId": "ASIAEXAMPLE7XYZ",
      "SecretAccessKey": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
      "Token": "IQoJb3JpZ2luX2VjE...long base64 session token...",
      "Expiration": "2026-06-21T12:00:00Z"
    }

    Those three fields, AccessKeyId, SecretAccessKey, and Token, are a working set of AWS credentials. Anyone holding them can sign API calls as the instance role until the Expiration time. There is no extra factor and no challenge. The credentials are simply sitting there at a known URL, waiting for a GET.

    The credential flow: from role to STS to keys

    To see why credentials appear out of thin air, follow where they come from. When you launch an instance you can attach an instance profile, which wraps an IAM role. The role is a bundle of permissions, for example the ability to read objects in one S3 bucket. The role has no long lived password. Instead, the host runs an agent that asks AWS Security Token Service, STS, for temporary credentials that embody the role. STS mints a short lived key, secret, and session token, stamps them with an expiry usually a few hours out, and the agent parks them in the metadata service for the instance to read.

    This is a good design in isolation. The instance never stores a permanent secret on disk. The credentials rotate automatically before they expire, so a copy you steal stops working on its own. The application code does not even need to know the keys exist, because the AWS SDK reads them from the metadata service for you. The whole point is to keep secrets off the box and short lived. The flaw is not in STS or in rotation. The flaw is that the doorway to those credentials is an unauthenticated HTTP endpoint that trusts the caller by location alone.

    How the instance metadata service becomes an attack

    An attacker who already has a shell on the instance does not need the metadata service. They can read those credentials, but they could read your disk and your environment variables too. The reason this endpoint is dangerous out of all proportion is that an attacker does not need a shell. They need only a way to make the instance issue one HTTP request to a URL of their choosing. That primitive is called server side request forgery, and it is one of the most common bugs in web applications. We cover the general class in our writeup on server side request forgery, but the metadata service is its highest value target by a wide margin.

    Picture a feature that fetches a URL for you. A SaaS app, call it Acme Notes, lets users add a profile picture by pasting an image URL. The server fetches that URL and stores the image. The developer pictured users pasting links to photos. Nothing stops a user from pasting this instead:

    http://169.254.169.254/latest/meta-data/iam/security-credentials/app-server-role

    The server, doing exactly what it was told, fetches that URL from inside its own network, where 169.254.169.254 resolves to the metadata service. The JSON credential block comes back and gets stored or echoed where the attacker can read it. The attacker never logged in to the instance. They handed it a URL and the instance read its own credentials out loud. With those keys an attacker configures the AWS command line tool and now acts with the full permissions of the role, from their own laptop, anywhere in the world.

    The metadata service does not leak credentials because it is broken. It leaks them because it answers honestly, and the application was tricked into asking the question on the attacker’s behalf.

    Capital One: the textbook case

    This is not theoretical. In July 2019 Capital One disclosed a breach that exposed personal data from roughly 106 million credit card applicants across the United States and Canada. The attack chain is now a standard teaching example because every link in it is one of the pieces above.

    The entry point was a misconfigured web application firewall running on an EC2 instance, built on ModSecurity. The firewall could be coerced into making a request on the attacker’s behalf, a server side request forgery. The attacker pointed that request at 169.254.169.254 and pulled the temporary credentials for the role attached to the firewall instance, a role reported as ISRM-WAF-Role. That role had permission to list and read S3 buckets, far more access than a firewall needed. Using the stolen credentials the attacker listed and then synced the contents of more than 700 buckets to a machine they controlled. One SSRF bug, one over permissioned role, and an unauthenticated metadata endpoint combined into one of the largest financial data breaches on record. The instance was using the original version of the metadata service, the one with no token required, which is the version we look at next.

    IMDSv1 versus IMDSv2: the token dance

    The version Capital One used, now called IMDSv1, is a plain request and response. You GET a URL, you get the answer. That is the entire protocol. It is also exactly what makes SSRF so effective against it, because the one thing a typical SSRF bug can do is cause a GET to an attacker chosen URL. The bug and the defense were a perfect match for each other, in the attacker’s favor.

    AWS responded with IMDSv2, a session oriented scheme that is worth understanding precisely, because the defense is clever and it leans on what SSRF usually cannot do. Under IMDSv2 you cannot just GET the data. First you have to open a session by making a PUT request for a token:

    PUT http://169.254.169.254/latest/api/token
    X-aws-ec2-metadata-token-ttl-seconds: 21600

    The service returns a token string. The TTL header sets how long the token stays valid, with a maximum of six hours, which is 21600 seconds. Every later request for actual metadata must carry that token in a header:

    GET http://169.254.169.254/latest/meta-data/iam/security-credentials/app-server-role
    X-aws-ec2-metadata-token: <token from the PUT>

    When the instance is configured to require IMDSv2, a request with no token or an expired token is refused with 401 Unauthorized. Now look at why this stops the profile picture attack. A normal SSRF bug lets you control a URL. It does not usually let you change the HTTP method from GET to PUT, and it does not usually let you add an arbitrary request header like X-aws-ec2-metadata-token-ttl-seconds. The attacker can still make the server GET the metadata URL, but without a token that GET now returns 401 instead of credentials. The defense does not try to detect malicious URLs. It raises the bar from a single GET to a two step exchange that uses verbs and headers a forged request almost never controls.

    There is a second, quieter guard built into the same scheme. The PUT that mints a token is rejected if it carries an X-Forwarded-For header. That header is the fingerprint of a request that passed through a proxy, which is precisely the shape of many SSRF and open proxy attacks. If your forged request arrived by way of a proxy that stamped X-Forwarded-For, the token request fails before it starts.

    The hop limit, a defense at the IP layer

    IMDSv2 adds one more control that lives below HTTP entirely. The response to the token PUT is sent with an IP time to live, the hop limit, of 1 by default. Time to live is the field in every IP packet that counts down by one at each router and drops the packet when it hits zero. A hop limit of one means the token response can reach a process on the instance itself, but it cannot survive being forwarded even a single hop further.

    Why does that matter? A common modern setup runs containers on the instance, and a misconfigured container network can let a pod reach the metadata service through the host, adding a hop. With the default hop limit of one, the token packet dies before it reaches the container, so a compromised container cannot complete the IMDSv2 handshake through that extra hop. You can raise the limit with modify-instance-metadata-options when a legitimate setup needs it, but the safe default assumes the only thing that should be talking to the metadata service is the instance itself, not anything one network hop away.

    The same idea on the other clouds

    This is not an AWS quirk. The pattern is industry wide, and the same magic address shows up on the other major providers, which is worth knowing because a single SSRF payload is often tried against all three.

    Google Cloud serves metadata at 169.254.169.254 and at the friendlier name metadata.google.internal. Its defense is a required header: every request must include Metadata-Flavor: Google. A plain GET with no header is refused. The reasoning is the same as the IMDSv2 token, that a typical SSRF bug controls the URL but not the headers, so demanding a custom header filters out the forged requests that only know how to set a path.

    Azure uses the same IP and requires the header Metadata: true plus an api-version parameter on the query string. Again the shape is identical. The metadata is valuable, the endpoint is unauthenticated by network position, and the guard is a request element that a forged URL fetch is unlikely to carry. Three clouds, one address, and the same lesson about trusting a caller because of where it sits.

    When blocking the address is not enough

    A defender who learns about this attack reaches for the obvious fix: if a user supplied URL points at 169.254.169.254, reject it. That helps, but a naive string match is a speed bump, because the address can be written in many shapes and an attacker needs only one of them to slip through. The evasions are the difference between a filter that holds and one that only looks like it holds.

    The same address has many spellings. 169.254.169.254 is four bytes, and those bytes can be written as one decimal number, 2852039166, or in octal, or in hex, and many HTTP clients parse all of them back to the same destination. A blocklist that only knows the dotted form never sees the decimal one. AWS also serves the metadata service over IPv6 at [fd00:ec2::254] on newer instances, so a filter that only thinks in IPv4 misses an entire second door.

    Then there are the tricks that defeat checking the host at all. With DNS rebinding, the attacker controls a domain that resolves to a harmless address the first time the app checks it, then flips to 169.254.169.254 a moment later when the app actually connects. The validation and the connection see different answers. With a redirect, the attacker hands the app a URL on a domain that passes validation, and that server replies with an HTTP redirect to the metadata IP, which many fetch libraries follow on their own. The app checked the first hop and walked into the second. We pull that thread further in our writeup on open redirects, because the same trust in a validated host powers both bugs.

    There is also the case where the app fetches the URL but never shows you the result. That is blind server side request forgery. The metadata response comes back, but it lands in a log or a thumbnail the attacker cannot read directly. The attack is not dead, only quieter. The attacker arranges for the fetched credentials to surface somewhere reachable, a field that is displayed later or an out of band channel they control. Blind does not mean safe, it means slower.

    Once the credentials are out, the metadata service has done its damage and the attacker moves on. The first thing a careful attacker does with stolen keys is ask who they belong to and what they are allowed to touch, then map the blast radius before doing anything noisy. That is why least privilege on the role matters as much as blocking the address. The endpoint decides whether credentials leak. The role decides how much the leak is worth.

    How to actually lock it down

    The good news is that the controls stack, and none of them depend on finding every SSRF bug first. Defense in depth here is real, not a slogan.

    • Require IMDSv2 and turn IMDSv1 off. Set the instance metadata options so that a token is mandatory. This single change neutralizes the plain GET attack that took down Capital One. New instances can enforce it from launch, and you can flip existing ones with modify-instance-metadata-options.
    • Keep the hop limit at 1 unless a specific workload proves it needs more. If you run containers, prefer a setup that gives pods their own scoped credentials rather than reaching through the host.
    • Give the role the least privilege it can do its job with. The Capital One role could read hundreds of buckets it never needed. If that role had been allowed to touch only the one bucket the firewall required, the same SSRF would have leaked a far smaller blast radius. The metadata service handing out credentials is only as dangerous as the credentials it hands out.
    • Filter egress and block the metadata IP at the application layer. If a feature fetches user supplied URLs, refuse any request whose host resolves into the link local range, and do the check after resolving the name, not before, so a hostname that points at 169.254.169.254 cannot sneak past.

    The assumption that breaks

    Step back from the headers and the JSON and the one thing left is an assumption. The metadata service was built to trust any caller that reaches it from inside the instance, because in 2009 the inside of an instance was a place only you could be. The web application running on top of that instance quietly broke the assumption. The moment an app fetches a URL on a user’s behalf, the user can reach anything the app can reach, and the app can reach 169.254.169.254. The boundary everyone pictured, the wall around the instance, was not the boundary that mattered. The boundary that mattered ran through a profile picture field.

    That gap between what a system assumes about its callers and what an attacker can actually arrange is the kind of thing you find by asking what each component trusts and why, rather than by scanning for a known bad string. The metadata service is honest, the SDK is convenient, the role rotates its keys, and the sum of those reasonable parts is a path from one web request to a cloud account. Require the token, cut the permissions, block the address at the edge, and the most dangerous IP in your cloud goes back to being a boring configuration helper.

    Frequently asked questions

    What is the instance metadata service used for?

    It is a local endpoint at 169.254.169.254 that lets a cloud virtual machine read facts about itself, like its instance ID, network setup, and startup script, without being configured by hand. The dangerous part is that it also serves the temporary credentials for the IAM role attached to the instance, which is why it is a prime target once an attacker can make the machine send a request.

    How does SSRF lead to stealing cloud credentials?

    If an application can be tricked into fetching an attacker chosen URL, the attacker points it at http://169.254.169.254/latest/meta-data/iam/security-credentials/ and the server reads its own role credentials back. The endpoint trusts any caller on the instance, so a single server side request forgery bug becomes a full set of working AWS keys. This is the exact chain behind the 2019 Capital One breach.

    Does IMDSv2 fully prevent metadata attacks?

    IMDSv2 raises the bar a lot but is not a complete fix on its own. It forces a PUT request for a session token and a custom header on every read, which a typical SSRF cannot supply, so plain GET attacks fail. You still need least privilege on the role and egress filtering, because attackers chain redirects, DNS rebinding, and alternate IP encodings to reach the endpoint. AWS documents the scheme in its IMDS guide.

    Do Google Cloud and Azure have the same metadata risk?

    Yes, both serve metadata at the same 169.254.169.254 address and carry the same risk. Google Cloud requires a Metadata-Flavor: Google header and Azure requires Metadata: true, and like IMDSv2 those required headers exist to filter out forged URL fetches that only control the path. A single SSRF payload is often tested against all three clouds.