Why does Linux favor 0x7f mappings?

Question

By running a simple less /proc/self/maps I see that most mappings start with 55 and 7F. I also noticed these ranges to be used whenever I debug any binary.

In addition this comment here suggests that the kernel has indeed some range preference.

Why is that? Is there some deeper technical reason for the above ranges? Will there be a problem if I manually mmap pages outside of these prefixes?

Marco Bonelli · Accepted Answer · 2022-01-23T23:01:32.700

First and foremost, assuming that you are talking about x86-64, we can see that the virtual memory map for x86-64 is:

========================================================================================================================
    Start addr    |   Offset   |     End addr     |  Size   | VM area description
========================================================================================================================
                  |            |                  |         |
 0000000000000000 |    0       | 00007fffffffffff |  128 TB | user-space virtual memory, different per mm
__________________|____________|__________________|_________|___________________________________________________________
 ...              |    ...     | ...              |  ...

Userspace addresses are always in the canonical form in x86-64, using only the lower 48 bits with 4-level page tables or 57 bits with 5-level page tables (note that the highest bit is sign extended and only set to 1 for the kernel, therefore in reality you only see at most 47 or 56 bits set in userspace with the most significant always set to 0).

See:

This puts the end of user-space virtual memory at 0x7fffffffffff. This is where the stack of new programs starts: that is, 0x7ffffffff000 (minus some random offset due to ASLR) and growing to lower addresses.

Let me address the simple question first:

Will there be a problem if I manually mmap pages outside of these prefixes?

Not at all, the mmap syscall always checks the address that is being requested, and it will refuse to map pages that overlap an already mapped memory area or pages at completely invalid addresses (e.g. addr < mmap_min_addr or addr > 0x7ffffffff000).

Now... diving straight into the Linux kernel code, precisely in the kernel ELF loader (fs/binfmt_elf.c:960), we can see a pretty long and esplicative comment:

/*
 * This logic is run once for the first LOAD Program
 * Header for ET_DYN binaries to calculate the
 * randomization (load_bias) for all the LOAD
 * Program Headers, and to calculate the entire
 * size of the ELF mapping (total_size). (Note that
 * load_addr_set is set to true later once the
 * initial mapping is performed.)
 *
 * There are effectively two types of ET_DYN
 * binaries: programs (i.e. PIE: ET_DYN with INTERP)
 * and loaders (ET_DYN without INTERP, since they
 * _are_ the ELF interpreter). The loaders must
 * be loaded away from programs since the program
 * may otherwise collide with the loader (especially
 * for ET_EXEC which does not have a randomized
 * position). For example to handle invocations of
 * "./ld.so someprog" to test out a new version of
 * the loader, the subsequent program that the
 * loader loads must avoid the loader itself, so
 * they cannot share the same load range. Sufficient
 * room for the brk must be allocated with the
 * loader as well, since brk must be available with
 * the loader.
 *
 * Therefore, programs are loaded offset from
 * ELF_ET_DYN_BASE and loaders are loaded into the
 * independently randomized mmap region (0 load_bias
 * without MAP_FIXED).
 */
if (interpreter) {
    load_bias = ELF_ET_DYN_BASE;
    if (current->flags & PF_RANDOMIZE)
        load_bias += arch_mmap_rnd();
    elf_flags |= MAP_FIXED;
} else
    load_bias = 0;

In short, there are two types of ELF Position Independent Executables:

Normal programs: they require a loader in order to run. This represents basically 99.9% of the ELF programs on a normal Linux system. The path of the loader is specified in the ELF program headers, with a program header of type PT_INTERP.
Loaders: a loader is an ELF that does not specify a PT_INTERP program header, and that is responsible for loading and starting normal programs. It also does a bunch of fancy stuff behind the scenes (resolve relocations, load needed libraries, etc.) before actually starting the program that is being loaded.

When the kernel executes a new ELF through an execve syscall, it needs to map into memory the program itself and the loader. Control will then be passed to the loader that will resolve and map all needed shared libraries and finally pass control to the program. Since both the program and its loader need to be mapped, the kernel needs to make sure that those mappings don't overlap (and also that future mapping requests by the loader will not overlap).

In order to do this, the loader is mapped near the stack, (at a lower address than the stack, but with some tolerance, since the stack is allowed to grow by adding more pages if needed), leaving the duty of applying ASLR to mmap itself. The program is then mapped using a load_bias (as seen in the above snippet) to put it far enough from the loader (at a much lower address).

If we take a look at ELF_ET_DYN_BASE, we see that it is architecture dependent and on x86-64 it evaluates to:

((1ULL << 47) - (1 << 12)) / 3 * 2 == 0x555555554aaa

Basically around 2/3 of TASK_SIZE. That load_bias is then adjusted adding arch_mmap_rnd() bytes if ASLR is enabled, and finally page-aligned. At the end of the day, this is the reason why we usually see addresses starting with 0x55 for programs.

When control is passed to the loader, the virtual memory area for the process has already been defined, and successive mmap syscalls that do not specify an address will return decreasing addresses starting near the loader. As we just saw the loader is mapped near the stack, and the stack is at the very end of the user address space: this is the reason why we usually see addresses starting with 0x7f for libraries.

There is a common exception to the above. In the case the loader is invoked directly, like for example:

/lib/x86_64-linux-gnu/ld-2.24.so ./myprog

The kernel will not map ./mpyprog in this case and will leave that to the loader. As a consequence, ./myprog will be mapped at some 0x7f... address by the loader.

You may be wondering: why doesn't the kernel always let the loader map the program then, or why isn't the program just mapped right before/after the loader? I don't have a 100% definitive answer for this, but a few reasons come to mind:

Consistency: making the kernel itself load the ELF into memory without depending on the loader avoids trouble. If this wasn't the case, the kernel would fully depend on the userspace loader, which is not advisable at all (this may also partially be a security concern).
Efficiency: we are sure that at least both the executable and its loader need to be mapped (regardless of any linked libraries), might as well save precious time and do it right away rather than wait for another syscall with associated context switch.
Security: in the default scenario, mapping the program at a different randomized address than the loader and other libraries provides a sort of "isolation" between the program itself and the loaded libraries. In other words, "leaking" any library address won't reveal the program position in memory, and vice-versa. Mapping the program at a predefined offset from the loader and other libraries would instead partially defeat the purpose of ASLR.

In an ideal security-driven scenario, every single mmap (i.e. any needed library) would also be placed at a randomized address independent of previous mappings, but this would hurt performance significantly. Keeping allocations grouped results in faster page table lookups: see Understanding The Linux Kernel (3rd edition), page 606: Table 15-3. Highest index and maximum file size for each radix tree height. It would also cause much greater virtual memory fragmentation, becoming a real problem for programs that need to map large files to memory. The substantial part of isolation between program code and library code is already done, going further has more cons than pros.
Ease of debugging: seeing RIP=0x55... vs RIP=0x7f... instantly helps figuring out where to look (program itself or library code).

It should be mentioned that `00007fffffffffff` is the very top of the low half of the canonical range of usable 48-bit virtual addresses. x86-64 48-bit virtual addresses are why Linux picked `(1ULL<<47) - 1`. [x86-64 canonical address?](https://stackoverflow.com/q/25852367), and [Address canonical form and pointer arithmetic](https://stackoverflow.com/q/38977755) has a diagram. — Peter Cordes, May 03 '20 at 03:07
Why have the kernel map the executable? Could also be efficiency; the execve system call has already found the ELF executable in the filesystem and parsed its program headers. Might as well create mappings for it instead of making ld.so find it via a pathname and then parse it. Also means there's no race condition where the executable is unlinked as soon as execve returns and then ld.so can't find it. Or any issue with permissions. — Peter Cordes, May 03 '20 at 03:13
Also note that before PIEs were a thing, normal executables were mapped to near the bottom of virtual address space (all static code and data in the low 2GiB, `ld` defaulting to `401000` as the start of the text section, i.e. only 4MiB above zero). This left almost the whole virtual address space open for a potentially huge contiguous allocation between static code/data at the bottom and stack + mmap defaults near the top, in case anyone cared to do that. — Peter Cordes, May 03 '20 at 03:16
Also note that keeping your allocations grouped near each other (not sparse) is good for page-table efficiency. It's a radix tree so keeping your allocations grouped in 1GiB chunks means they're in the same subtree, leaving more parts of the full tree that don't need to exist (not present in the top-level page directory). — Peter Cordes, May 03 '20 at 03:19
@PeterCordes Thanks for the links! Also, yes, making the kernel depend on the loader for each `execve` doesn't make much sense anyway. I did not think about the performance, that's also a good point, but I bet the primary reason is not depending on the userspace loader. — Marco Bonelli, May 03 '20 at 09:52
I'm confused by the description of the process address space: you say the program is above the loader, which is above the stack. Then that `mmap` will map from `0x7f` down because it starts mapping *above* the loader. But `0x55` (program) is < `0x7f` (loader/mmap). — Margaret Bloom, May 03 '20 at 12:44
@MargaretBloom when I say *"the program is mapped [...] somewhere above the loader"* I mean "above" as in "at a lower address". Same when I say *"successive mmap syscalls that do not specify an address will return decreasing addresses starting right above the loader"*, I mean that `mmap` will mapp from `0x7f...` up (where up means lower addresses than `0x7f...`). I am assuming a VM layout like the one depicted in the kernel documentation that I linked, top = `0x0`, bottom = `0xffffffffffffffff`. Is that not clear? I could re-word it if you have a suggestion for a better way to phrase it. — Marco Bonelli, May 03 '20 at 12:51
@MarcoBonelli I'm not a native English speaker, so it may just be me but I think you are using "up" to mean "lower addresses" and "down" to mean "higher addresses". Kind of like you are using a map of the address space where numerically greater addresses are drawn below numerically lower ones (e.g. like the one from gdb) and using "up" and "down" to denote position in the *map" rather than in the address space. E.g. it seems like, for you, `0x55` is *above* `0x7f` (because it would be, for example, in what gdb outputs) but, for me, ordering is between addresses (so `0x55` is below `0x7f`) — Margaret Bloom, May 03 '20 at 13:06
@MargaretBloom I'm not a native english speaker either. I always use "above" and "below" thinking about the position in memory as if it was ordered with higher addresses at the bottom. I get your point, those sentences could mean two different, opposite things if interpreted differently. I'll reword the answer to avoid confusion. — Marco Bonelli, May 03 '20 at 13:12
PML5 is 57-bit virtual addresses. 48 + 9 = 57 from having an extra level of the same page-table format as usual. IDK why https://www.kernel.org/doc/Documentation/x86/x86_64/mm.txt says 56-bit; maybe someone had a brain fart and was thinking about how many bits were usable for kernel-space (with bit #57 fixed at 1). — Peter Cordes, Nov 27 '21 at 23:18
@PeterCordes thanks, clarified in my last edit. It probably says 56 because for userspace you can only have 56 *bits set*... but then it should say 47 for 4 level paging... and that'd be misleading anyway, whatever. — Marco Bonelli, Nov 27 '21 at 23:41
Yeah, that's what I figured while commenting on the same 56 vs 48 thing in my answer on [ASLR and memory layout on 64 bits: Is it limited to the canonical part (128 TiB)?](https://stackoverflow.com/a/70139356) Hopefully an unintentional use of 56 as a brain fart by the author. — Peter Cordes, Nov 28 '21 at 02:51
This is a great answer, thanks. Any idea how it might vary for AArch64? I see that sometimes thread stack mappings wind up at `0x55...`, but maybe not always. — jacobsa, Sep 15 '22 at 06:33
@jacobsa thread stacks are different. They end up in the same area as library mappings because libc simply does a mmap before the clone syscall that creates the thread. A thread stack could theoretically be anywhere in the address space, it's up to the library implementation, or even up to you if you use clone manually. — Marco Bonelli, Sep 15 '22 at 11:34
Thread stacks on x86-64 in my environment seem to always start with `0x7f` or `0x7e` (or progressively lower if tons of threads). This seems to be because `pthread_create` does an `mmap`, and the same logic for libraries winding up there from the answer applies, right? But this is not true on aarch64, and I'm not sure why. — jacobsa, Sep 16 '22 at 22:05
@jacobsa yes on x86 that's because the same logic I explained above for library mappings applies. I'm not sure about arm64, you should take a look at the glibc implementation to know why. After all, AFAIK the thread stack address is up to userspace to choose. — Marco Bonelli, Sep 16 '22 at 22:53

Why does Linux favor 0x7f mappings?

1 Answers1

Linked

Related