0

In the x86-64 System V ABI it is specified that the space behind the $rsp - 128 is the so-called red zone which is not touched by any signal handlers. On my machine

$ ulimit -s
8192

I expected there is only 2 pages in the stack. So I wrote the following program to test till which size red zone can expand:

PAGE_SIZE equ 0x1000
SYS_exit equ 0x3C

section .text
global _start 

_start:
    lea rcx, [rsp - 0x1f * PAGE_SIZE]
    mov rax, rsp
loop:
    sub rax, PAGE_SIZE
    mov qword [rax], -1
    cmp rax, rcx
    jne loop

    mov rax, SYS_exit
    mov rdi, 0x20

So I expected the program always fails. But the program sometimes fails with SEGV, sometimes finishes fine.

The behavior is exactly as what MAP_GROWSDOWN documents:

This flag is used for stacks. It indicates to the kernel virtual memory system that the mapping should extend downward in memory. The return address is one page lower than the memory area that is actually created in the process's virtual address space. Touching an address in the "guard" page below the mapping will cause the mapping to grow by a page. This growth can be repeated until the mapping grows to within a page of the high end of the next lower mapping, at which point touching the "guard" page will result in a SIGSEGV signal.

As discussed in this question mappings created with MAP_GROWSDOWN and PROT_GROWSDOWN does not grow that way:

volatile char *mapped_ptr = mmap(NULL, 4096,
                        PROT_READ | PROT_WRITE | PROT_GROWSDOWN,
                        MAP_GROWSDOWN | MAP_ANONYMOUS | MAP_PRIVATE,
                        -1, 0); 

mapped_ptr[4095] = 'a';  //OK!
mapped_ptr[0]    = 'b';  //OK!
mapped_ptr[-1]   = 'c';  //SEGV

QUESTION: Combining the reasoning above is it true that the only mapping that uses MAP_GROWSDOWN is the main thread's [stack] mapping ?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
St.Antario
  • 26,175
  • 41
  • 130
  • 318
  • 2
    The red zone is always the 128 bytes beneath the current value of RSP. It is not based on the size of the stack at all. – Michael Petch Jul 06 '19 at 15:33
  • @MichaelPetch You mean the interval `[rsp - 128, rsp]`? – St.Antario Jul 06 '19 at 15:34
  • That is correct. – Michael Petch Jul 06 '19 at 15:34
  • @MichaelPetch So what kind of zones go below the `rsp - 128`? In the assembly example I sometimes could expand the stack up to 20 pages below the `rsp`... – St.Antario Jul 06 '19 at 15:35
  • 1
    @MichaelPetch `The red zone is always the 128 bytes` - is that the same red zone Raymond Chen [discussed](https://devblogs.microsoft.com/oldnewthing/20190111-00/?p=100685)? – GSerg Jul 06 '19 at 15:42
  • @GSerg Very interesting article! So there is no red-zone on Windows and each time we push something on the stack we have to decrement `rsp` appropriately, isn't? – St.Antario Jul 06 '19 at 15:59
  • 4
    Note that the `ulimit` builtin prints sizes in 1024 byte increments. So when `ulimit -s` print 8192, it means that your stack limit is 8MiB -- 2048 pages. – Chris Dodd Jul 06 '19 at 17:12
  • 2
    Since the OS is not tagged, it should be noted that Windows does not have a red zone. – rcgldr Jul 06 '19 at 20:27
  • @rcgldr: It does unambiguously link to the x86-64 System V ABI. Window Subsystem for Linux *does* have a red-zone, otherwise Windows doesn't use that ABI. – Peter Cordes Jul 06 '19 at 20:54
  • 1
    Why is the loop so over-complicated, vs. `sub rax, 4096`? I think you're doing the equivalent by redoing the multiply every time, but it's harder to follow. – Peter Cordes Jul 06 '19 at 23:47
  • Ok that's better, but still some wasted instructions. Like `mov [rax], eax` would be simpler, or `mov qword [rax], -1` if you really insist on a qword store with that value. IDK why you have a separate `mov rbx, -1` inside the loop, that makes no sense. Plus it would be more idiomatic to use `dec ecx/jnz` or something at the bottom of the loop, instead of dec/cmp/jnz. Especially starting with a non-zero ECX takes extra time for human readers to sort out the loop trip count. The clearest might be `lea rcx, [rsp - 0x20 * PAGE_SIZE]` as a lower bound for the pointer, just cmp/jne – Peter Cordes Jul 07 '19 at 08:25
  • Oh also, I'd have just put that loop in `_start`, not `call`ed it to give readers more code to sort through to figure out where RSP might be pointing when this runs. – Peter Cordes Jul 07 '19 at 08:30
  • You can also simplify your C: `volatile char *p = mmap(...);` then you can just do `p[4095] = 1;` `p[-1] = 1;` or whatever. C has compact / clear syntax for accessing memory *near* a pointer. – Peter Cordes Jul 07 '19 at 08:31
  • 1
    @PeterCordes - my comment was meant mostly for others that might be reading this. The question is tagged with system v abi, which isn't Windows, but there's no OS tag on the question, and it may not be clear to others reading this that a red zone is OS dependent. – rcgldr Jul 07 '19 at 09:58

2 Answers2

5

You are confusing 2 different concepts, except that they both involve the stack the red zone and the extension of the stack memory area are unrelated. Memory locations below the red zone but within the stack will be altered if a signal handler is called and no alternative signal handler stack is specified.

I suspect the failure of the mmap allocated MAP_GROWSDOWN area to grow is that another area is shortly below, mmap will typically allocate virtual addresses consecutively downwards.

Timothy Baldwin
  • 3,551
  • 1
  • 14
  • 23
  • Thanks for the explanation of the red zone area. As per `MAP_GROWSDON` failure, I checked the memory mapping of the process and found that the pointer returned by the `mmap` call was equal to `0x7fb361abc000`, but the closest mapped region below it was `7fb361aa2000` so this is unlikely to be the case. I also tried to specify a specific address as an argument to the `mmap`, but the same error was the result. – St.Antario Jul 06 '19 at 19:58
  • @St.Antario: the main thread's stack is *not* `MAP_GROWSDOWN`. It can grow as far as the max stack size in one step without "stack probes" that touch guard pages (but [only if RSP is decremented first](//stackoverflow.com/a/56718182/224132)), *and* the potential stack-growth area is reserved so other allocations don't accidentally steal it. (Neither of these are true for `MAP_GROWSDOWN`, which is why it's not safe for thread stacks. pthreads allocates full-size thread stacks because Linux does lazy alloc of physical pages anyway. See [this](//stackoverflow.com/q/46790666). – Peter Cordes Jul 06 '19 at 20:50
3

None of this has anything to do with the red-zone because you aren't moving RSP. Memory protection works with page granularity, but the red-zone is always only 128 bytes below RSP that's safe to read/write as well as safe from async clobber.


No, nothing uses MAP_GROWSDOWN unless you use it manually. The main thread's stack uses a non-broken mechanism that doesn't let other mmap calls randomly steal its growth space. See my answer on Analyzing memory mapping of a process with pmap. [stack]

The sometimes-success of your asm code is an exact duplicate of Why does this code crash with address randomization on? - you're touching memory up to 124 KiB below RSP, so the initial allocation of 132 KiB happens to be enough sometimes, depending on ASLR and how much space args + env takes on the stack.

Why is MAP_GROWSDOWN mapping does not grow? is the interesting part: MAP_GROWSDOWN may not work with a 1-page mapping. But again, this has nothing to do with stacks. The man page saying "This flag is used for stacks." is 100% wrong. That was the intent when adding the feature, but the design isn't actually usable so the implementation may be buggy even vs. the documentation.

ecm
  • 2,583
  • 4
  • 21
  • 29
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • To summarize. Kernel is ok with growing the stack as long as `ulimit -s` is preserved, but failure occurs if we touch a page outside of the initially mapped `132KiB` and the normal `#PF` handling mechanism sends `SIGSEGV` to the process since we tried to get access to the memory not mapped by the process. I checked `sub rsp, 0x10000 * PAGE_SIZE\n mov qword [rsp], -1` with `ulimit -s unlimited` and it worked just fine even without intermediate pages being touched while touching the page without adjusting the `rsp` results in `SIGSEGV`. – St.Antario Jul 07 '19 at 11:43
  • 2
    @St.Antario: yes, the kernel will treat #PF as valid and grow the stack mapping up to `ulimit -s`, as long as the bottom of the red-zone (or RSP) is below or inside the page that faulted. Otherwise it's just an invalid #PF -> SIGSEGV. (The first sentence of your comment left out the part about RSP having to move, so it's not a great summary.) – Peter Cordes Jul 07 '19 at 11:52