Why can't I mmap(MAP_FIXED) the highest virtual page in a 32-bit Linux process on a 64-bit kernel?

Question

While attempting to test Is it allowed to access memory that spans the zero boundary in x86? in user-space on Linux, I wrote a 32-bit test program that tries to map the low and high pages of 32-bit virtual address space.

After echo 0 | sudo tee /proc/sys/vm/mmap_min_addr, I can map the zero page, but I don't know why I can't map -4096, i.e. (void*)0xfffff000, the highest page. Why does mmap2((void*)-4096) return -ENOMEM?

strace ./a.out 
execve("./a.out", ["./a.out"], 0x7ffe08827c10 /* 65 vars */) = 0
strace: [ Process PID=1407 runs in 32 bit mode. ]
....
mmap2(0xfffff000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0

Also, what check is rejecting it in linux/mm/mmap.c, and why is it designed that way? Is this part of making sure that creating a pointer to one-past-an-object doesn't wrap around and break pointer comparisons, because ISO C and C++ allow creating a pointer to one-past-the-end, but otherwise not outside of objects.

I'm running under a 64-bit kernel (4.12.8-2-ARCH on Arch Linux), so 32-bit user space has the entire 4GiB available. (Unlike 64-bit code on a 64-bit kernel, or with a 32-bit kernel where the 2:2 or 3:1 user/kernel split would make the high page a kernel address.)

I haven't tried from a minimal static executable (no CRT startup or libc, just asm) because I don't think that would make a difference. None of the CRT startup system calls look suspicious.

While stopped at a breakpoint, I checked /proc/PID/maps. The top page isn't already in use. The stack includes the 2nd highest page, but the top page is unmapped.

00000000-00001000 rw-p 00000000 00:00 0             ### the mmap(0) result
08048000-08049000 r-xp 00000000 00:15 3120510                 /home/peter/src/SO/a.out
08049000-0804a000 r--p 00000000 00:15 3120510                 /home/peter/src/SO/a.out
0804a000-0804b000 rw-p 00001000 00:15 3120510                 /home/peter/src/SO/a.out
f7d81000-f7f3a000 r-xp 00000000 00:15 1511498                 /usr/lib32/libc-2.25.so
f7f3a000-f7f3c000 r--p 001b8000 00:15 1511498                 /usr/lib32/libc-2.25.so
f7f3c000-f7f3d000 rw-p 001ba000 00:15 1511498                 /usr/lib32/libc-2.25.so
f7f3d000-f7f40000 rw-p 00000000 00:00 0 
f7f7c000-f7f7e000 rw-p 00000000 00:00 0 
f7f7e000-f7f81000 r--p 00000000 00:00 0                       [vvar]
f7f81000-f7f83000 r-xp 00000000 00:00 0                       [vdso]
f7f83000-f7fa6000 r-xp 00000000 00:15 1511499                 /usr/lib32/ld-2.25.so
f7fa6000-f7fa7000 r--p 00022000 00:15 1511499                 /usr/lib32/ld-2.25.so
f7fa7000-f7fa8000 rw-p 00023000 00:15 1511499                 /usr/lib32/ld-2.25.so
fffdd000-ffffe000 rw-p 00000000 00:00 0                       [stack]

Are there VMA regions that don't show up in maps that still convince the kernel to reject the address? I looked at the occurrences of ENOMEM in linux/mm/mmapc., but it's a lot of code to read so maybe I missed something. Something that reserves some range of high addresses, or because it's next to the stack?

Making the system calls in the other order doesn't help (but PAGE_ALIGN and similar macros are written carefully to avoid wrapping around before masking, so that wasn't likely anyway.)

Full source, compiled with gcc -O3 -fno-pie -no-pie -m32 address-wrap.c:

#include <sys/mman.h>

//void *mmap(void *addr, size_t len, int prot, int flags,
//           int fildes, off_t off);

int main(void) {
    volatile unsigned *high =
        mmap((void*)-4096L, 4096, PROT_READ | PROT_WRITE,
             MAP_FIXED|MAP_PRIVATE|MAP_ANONYMOUS,
             -1, 0);
    volatile unsigned *zeropage =
        mmap((void*)0, 4096, PROT_READ | PROT_WRITE,
             MAP_FIXED|MAP_PRIVATE|MAP_ANONYMOUS,
             -1, 0);


    return (high == MAP_FAILED) ? 2 : *high;
}

(I left out the part that tried to deref (int*)-2 because it just segfaults when mmap fails.)

what if you try a larger chunk even if you dont need all of it 0x10000000 bytes for example. — old_timer, Dec 10 '17 at 18:20
@old_timer: hmm, worth a try. Would have to do it from asm though, because the stack starts in the page below the one I want to map (and the current `ESP` is still in that page or the one below when `main` runs). — Peter Cordes, Dec 10 '17 at 18:22
https://stackoverflow.com/questions/8547071/32-bit-process-s-address-space-on-64-bit-linux — Ross Ridge, Dec 10 '17 at 18:23
@RossRidge: The answer there is wrong. The lowest error-return value is `-4095`, which is `0xfffff001`, *not* `0xfffff000`. Linux system calls are able to return every possible page address, including the highest one. I was wondering if it was being reserved for the vDSO page, but the answer there says it's the 2nd-highest page that's reserved for the vDSO. (But in my process, that page is part of `[stack]`, so clearly Linux has changed since then, or it's wrong about that, too.) — Peter Cordes, Dec 10 '17 at 18:26
I don't see how that makes a difference. The addresses -4095 to -1 are within the last page so Linux doesn't let you allocate it. Also the second to last page isn't part of the stack according to `/proc/PID/maps`. — Ross Ridge, Dec 10 '17 at 18:31
@RossRidge: It makes a difference between `mmap` can return any page address, so it can return `0xfffff000` as a non-error return value. I think the only system call that returns a pointer that might not be page-aligned is `brk`. That part of the answer looks incorrectly made-up and not directly supported by the comments in the kernel source. (The way I understood that comment was that choosing `-4095` to `-1` allows distinguishing error from pointer for all system calls including `mmap` without losing any address-space.) — Peter Cordes, Dec 10 '17 at 18:41
Oops, you're right, the stack mapping ends at `0xffffe000` (non-inclusive). — Peter Cordes, Dec 10 '17 at 18:44
The way I read it is that "error-valued pointers" are used extensively throughout the kernel and so the addresses in the range -4095 to -1 need to be reserved, if only because no one can be sure how these pointer values are used or will be used. — Ross Ridge, Dec 10 '17 at 19:11
@RossRidge: I think that's probably the right interpretation. I haven't found an explicit check to reject attempts to map the highest page, but maybe there's a hidden VMA that [`mmap_region` finds but can't `munmap`](https://elixir.free-electrons.com/linux/v4.12/source/mm/mmap.c#L1627). Probably this VMA includes the top two pages to reserve space for the vDSO as well. But anyway, letting user-space map the top page would mean that stuff like `read(0, 0xfffff123, 100)` would have to work, and the kernel probably wants to return that pointer from a check_valid function. — Peter Cordes, Dec 10 '17 at 19:20
Basically I hadn't grokked that `IS_ERR_VALUE()` is used internally, not just as for values that are being directly returned as syscall return values. — Peter Cordes, Dec 10 '17 at 19:22

Hadi Brais · Accepted Answer · 2019-02-27T15:45:49.193

The mmap function eventually calls either do_mmap or do_brk_flags which do the actual work of satisfying the memory allocation request. These functions in turn call get_unmapped_area. It is in that function that the checks are made to ensure that memory cannot be allocated beyond the user address space limit, which is defined by TASK_SIZE. I quote from the code:

 * There are a few constraints that determine this:
 *
 * On Intel CPUs, if a SYSCALL instruction is at the highest canonical
 * address, then that syscall will enter the kernel with a
 * non-canonical return address, and SYSRET will explode dangerously.
 * We avoid this particular problem by preventing anything executable
 * from being mapped at the maximum canonical address.
 *
 * On AMD CPUs in the Ryzen family, there's a nasty bug in which the
 * CPUs malfunction if they execute code from the highest canonical page.
 * They'll speculate right off the end of the canonical space, and
 * bad things happen.  This is worked around in the same way as the
 * Intel problem.

#define TASK_SIZE_MAX   ((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE)

#define IA32_PAGE_OFFSET    ((current->personality & ADDR_LIMIT_3GB) ? \
                    0xc0000000 : 0xFFFFe000)

#define TASK_SIZE       (test_thread_flag(TIF_ADDR32) ? \
IA32_PAGE_OFFSET : TASK_SIZE_MAX)

On processors with 48-bit virtual address spaces, __VIRTUAL_MASK_SHIFT is 47.

Note that TASK_SIZE is specified depending on whether the current process is 32-bit on 32-bit, 32-bit on 64-bit, 64-bit on 64-bit. For 32-bit processes, two pages are reserved; one for the vsyscall page and the other used as a guard page. Essentially, the vsyscall page cannot be unmapped and so the highest address of the user address space is effectively 0xFFFFe000. For 64-bit processes, one guard page is reserved. These pages are only reserved on 64-bit Intel and AMD processors because only on these processors the SYSCALL mechanism is used.

Here is the check that is performed in get_unmapped_area:

if (addr > TASK_SIZE - len)
     return -ENOMEM;

I haven't *tried* 32-bit on a 32-bit kernel, but the kernel itself maps the upper 1 or 2GiB of virtual address space in that case. So there's no expectation of user-space being able to map it. — Peter Cordes, Mar 15 '18 at 03:36
`syscall` setting up `sysret` to fail is a problem because if `sysret` explodes, it does so *in kernel mode* (kernel #GP with user-space RSP value = exploit). This is why processes that have had their RIP modified by `ptrace` return to user-space with `iret` (https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S#L253). A `ret` to a `call` that was at the end of a mapping is already something that will crash user-space, so the kernel doesn't have to stop user-space crashing itself that way, only from exploiting `sysret` "bugs" / design flaws. — Peter Cordes, Mar 15 '18 at 03:56
@PeterCordes I still don't understand why SYSRET with non-canonical return address is dangerous, though. I thought using non-canonical addresses would generate a fault. The kernel can catch that and terminate the program. But apparently this is not the case. — Hadi Brais, Mar 15 '18 at 04:13
Apparently `sysret` is dangerous because of CPU bugs on both AMD and Intel. Read the comments in the kernel code I linked, and look for other `sysret` comments in that file. On Intel, the #GP because of a bad user-space RIP happens *while still in ring 0*, and `rsp` has already been set to a value controlled by user-space. So another user-space thread could overwrite the exception-return frame and take control of the kernel, if I understand this correctly. That's why the kernel can't safely catch the exception. Like I said, CPU bug / design flaw. — Peter Cordes, Mar 15 '18 at 04:19
When linking to code on GitHub, please press `y` to get a permalink, because when `master` moves, your line reference can (and did) become invalid. — Jonathon Reinhart, Feb 27 '19 at 11:47
@JonathonReinhart Thank you for the tip. I've fixed all the links *permanently*. — Hadi Brais, Feb 27 '19 at 15:46

Why can't I mmap(MAP_FIXED) the highest virtual page in a 32-bit Linux process on a 64-bit kernel?

1 Answers1

Linked