ASLR and memory layout on 64 bits: Is it limited to the canonical part (128 TiB)?

Question

When loading a PIE executable with ASLR enabled, will Linux restricts the mapping of the program segments to the canonical section (up to 0000_7fff_ffff_ffff) or will it use the full lower section (starting bit 0)?

It could be processor specific. Not the same on AMD Ryzen Thunderbird desktop and on some cheap laptop. Try `cat /proc/self/maps` and `cat /proc/$$/maps` several times — Basile Starynkevitch, Nov 27 '21 at 18:42
What do you mean by "full lower section"? Isn't 0 – 7fffffffffff the full lower section? — prl, Nov 27 '21 at 19:10
@prl The question is whether it's only to 7fffffffffff, or all the way to 7fffffffffffffff. — Joseph Sible-Reinstate Monica, Nov 27 '21 at 19:37
You seem to know what a canonical address is... so why do you think that "randomizing" non-canonical addresses and giving them to userspace programs makes sense? They are unusable, you simply can't address memory with those, the CPU would just raise a general protection fault. See also: [Why does Linux favor 0x7f mappings?](https://stackoverflow.com/q/61561331/3889449) — Marco Bonelli, Nov 27 '21 at 22:58
@MarcoBonelli Well I thought it was OS specific, but I just realized this doesn't make sense as translation is fully done in hardware, so if the hardware basically ignores the first 2 bytes and do 9/9/9/9 lookup on the following bits, then we can't do much to use the full space...Sorry — Aaa Bbb, Nov 28 '21 at 01:08
@AaaBbb well yes you're partially right. It is both OS and CPU specific in reality. The OS could restrict the user address space in different ways depending on how it sets up the CPU at boot. However for x86 you can't go outside "canonical" addresses as it's a hardware limitation. — Marco Bonelli, Nov 28 '21 at 01:13
Hardware doesn't *ignore* the upper 16 bits. It checks that they match bit 47 and raises #GP if they don't match. — prl, Nov 28 '21 at 02:01

Peter Cordes · Accepted Answer · 2021-11-28T02:49:21.553

Obviously Linux won't give your process unusable addresses, that would make it raise a #GP(0) exception (and thus segfault) when it tries to execute code from _start. (Or if close to the cutoff, when it tries to load or store .data or .bss)

That would actually happen on the instruction that tried to set RIP to a non-canonical value in the first place, likely an iret or sysret¹.

On systems with 48-bit virtual addresses, zero to 0000_7fff_ffff_ffff is the full lower half of virtual address space when represented as a sign-extended 64-bit value.

On systems with PML5 supported (and used by the kernel), virtual addresses are 57 bits wide, so
zero to 00ff_ffff_ffff_ffff is the low-half canonical range.

See https://www.kernel.org/doc/Documentation/x86/x86_64/mm.txt - the first row is the user-space range. (It talks about "56 bit" virtual addresses. That's incorrect or misleading, PML5 is 57-bit, an extra full level of page tables with 9 bits per level. So the low half is 56 bits with a 0 in the 57th and the high half is 56 bits with a 1 in the 57th.)

========================================================================================================================
    Start addr    |   Offset   |     End addr     |  Size   | VM area description
========================================================================================================================
                  |            |                  |         |
 0000000000000000 |    0       | 00007fffffffffff |  128 TB | user-space virtual memory, different per mm
__________________|____________|__________________|_________|___________________________________________________________
                  |            |                  |         |
 0000800000000000 | +128    TB | ffff7fffffffffff | ~16M TB | ... huge, almost 64 bits wide hole of non-canonical
                  |            |                  |         |     virtual memory addresses up to the -128 TB
                  |            |                  |         |     starting offset of kernel mappings.
__________________|____________|__________________|_________|___________________________________________________________
                                                            |
                                                            | Kernel-space virtual memory, shared between all processes:
...

Or for PML5:

 0000000000000000 |    0       | 00ffffffffffffff |   64 PB | user-space virtual memory, different per mm
__________________|____________|__________________|_________|___________________________________________________________
                  |            |                  |         |
 0000800000000000 |  +64    PB | ffff7fffffffffff | ~16K PB | ... huge, still almost 64 bits wide hole of non-canonical
                  |            |                  |         |     virtual memory addresses up to the -64 PB
                  |            |                  |         |     starting offset of kernel mappings.

Footnote 1:
As prl points out, this design allows an implementation to literally only have 48 actual bits to store RIP values anywhere in the pipeline, except jumps and detecting signed overflow in case execution runs off the end into non-canonical territory. (Maybe saving transistors in every place that has to store a uop, which needs to know its own address.) Unlike if you could jump / iret to an arbitrary RIP, and then the #GP(0) exception would have to push the correct 64-bit non-canonical address, which would mean the CPU would have to remember it temporarily.

It's also more useful for debugging to see where you jumped from, so it makes sense to design the rule this way because there's no use-case for jumping to a non-canonical address on purpose. (Unlike jumping to an unmapped page, where the #PF exception handler can repair the situation, e.g. by demand paging, so for that you want the fault address to be new RIP.)

Fun fact: using sysret with a non-canonical RIP on Intel CPUs will #GP(0) in ring 0 (CPL=0), so RSP isn't switched and is still = user stack. If any other threads existed, this would let them mess with memory the kernel was using as a stack. This is a design flaw in IA-32e, Intel's implementation of x86-64. That's why Linux uses iret to return to user space from the syscall entry point if ptrace has been used on this process during that time. The kernel knows a fresh process will have a safe RIP so it might actually use sysret to jump to user-space faster.

I see, I was confused since I forgot translation is done is hardware (see my comment). In fact I am asking this question from a security point of view : if i'm not mistaken the lower 12 bits (the offset) of an instruction's adress will always stay the same no matter which pages the aslr choses, so that leaves "only" 36 bits of randomness which is high but not "that" high so that may be bruteforced — Aaa Bbb, Nov 28 '21 at 01:14
@AaaBbb: Yes, that's the theoretical max amount of entropy for ASLR. But Linux isn't that extreme: [ASLR bits of Entropy of mmap()](https://stackoverflow.com/q/13826479) is an old Q&A. Think about what brute-forcing involves, though: usually a crash of the program under attack. The idea is to have enough entropy that there's a very low chance of guessing right on the first attempt, or a few attempts, presumably with monitoring / intrusion detection noticing repeated crashes. — Peter Cordes, Nov 28 '21 at 01:33
Re "That would be the first thing that happened after the kernel entered user-space via IRET with a non-canonical RIP." The #GP occurs in the kernel on the IRET instruction, rather than in user space when it tries to read memory using the non-canonical address. I think it was defined that way so that the processor doesn't have to store all 64 bits of RIP. — prl, Nov 28 '21 at 01:58
@prl: thanks, interesting point about never needing to store more than the virt address width. Probably lots of places need to store uops, and each one probably needs its own address, so that may well save transistors. — Peter Cordes, Nov 28 '21 at 02:43

ASLR and memory layout on 64 bits: Is it limited to the canonical part (128 TiB)?

1 Answers1

Linked