5

I have been browsing for a while and I am trying to understand how memory is allocated to the stack when doing for example:

push rax

Or moving the stack pointer to allocate space for local variables of a subroutine:

sub rsp, X    ;Move stack pointer down by X bytes 

What I understand is that the stack segment is anonymous in the virtual memory space,i.e., not file backed.

What I also understand is that the kernel will not actually map an anonymous virtual memory segment to physical memory until the program actually does something with that memory segment,i.e, write data. So, trying to read that segment before writing to it may cause an error.

In the first example the kernel will assign a frame page in physical memory if needed. In the second example I assume that the kernel will not assign any physical memory to the stack segment until the program actually writes data to an address in the stack stack segment.

Am I on the right track here?

deftextra
  • 123
  • 1
  • 7

2 Answers2

13

yes, you're on the right track here, pretty much. sub rsp, X is kind of like "lazy" allocation: the kernel only does anything after a #PF page fault exception from touching memory above the new RSP, not just modifying registers. But you can still consider the memory "allocated", i.e. safe for use.

So, trying to read that segment before writing to it may cause an error.

No, read won't cause an error. Anonymous pages that have never been written are copy-on-write mapped to a/the physical zero page, whether they're in the BSS, stack, or mmap(MAP_ANONYMOUS).

Fun fact: in micro-benchmarks, make sure you write each page of memory for input arrays, otherwise you're actually looping over the same physical 4k or 2M page of zeros repeatedly and will get L1D cache hits even though you still get TLB misses (and soft page faults)! gcc will optimize malloc+memset(0) to calloc, but std::vector will actually write all the memory whether you want it to or not. memset on global arrays is not optimized out, so that works. (Or non-zero initialized arrays will be file-backed in the data segment.)


Note, I'm leaving out the difference between mapped vs. wired. i.e. whether an access will trigger a soft/minor page fault to update the page tables, or whether it's just a TLB miss and the hardware page-table walk will find a mapping (to the zero page).

But stack memory below RSP may not be mapped at all, so touching it without moving RSP first can be an invalid page fault instead of a "minor" page fault to sort out copy-on-write.


Stack memory has an interesting twist: The stack size limit is something like 8MB (ulimit -s), but in Linux the initial stack for the first thread of a process is special. For example, I set a breakpoint in _start in a hello-world (dynamically linked) executable, and looked at /proc/<PID>/smaps for it:

7ffffffde000-7ffffffff000 rw-p 00000000 00:00 0                          [stack]
Size:                132 kB
Rss:                   8 kB
Pss:                   8 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:         8 kB
Referenced:            8 kB
Anonymous:             8 kB
...

Only 8kiB of stack has been referenced and is backed by physical pages. That's expected, since the dynamic linker doesn't use a lot of stack.

Only 132kiB of stack is even mapped into the process's virtual address space. But special magic stops mmap(NULL, ...) from randomly choosing pages within the 8MiB of virtual address space that the stack could grow into.

Touching memory below the current stack mapping but within the stack limit causes the kernel to grow the stack mapping (in the page-fault handler).

(But only if rsp is adjusted first; the is only 128 bytes below rsp, so ulimit -s unlimited doesn't make touching memory 1GB below rsp grow the stack to there, but it will if you decrement rsp to there and then touch memory.)

This only applies to the initial/main thread's stack. pthreads just uses mmap(MAP_ANONYMOUS|MAP_STACK) to map an 8MiB chunk that can't grow. (MAP_STACK is currently a no-op.) So thread stacks can't grow after allocation (except manually with MAP_FIXED if there's space below them), and aren't affected by ulimit -s unlimited.


This magic preventing other things from choosing addresses in the stack-growth region doesn't exist for mmap(MAP_GROWSDOWN), so do not use it to allocate new thread stacks. (Otherwise you could end up with something using up the virtual address space below the new stack, leaving it unable to grow). Just allocate the full 8MiB. See also Where are the stacks for the other threads located in a process virtual address space?.

MAP_GROWSDOWN does have a grow-on-demand feature, described in the mmap(2) man page, but there's no growth limit (other than coming close to an existing mapping), so (according to the man page) it's based on a guard-page like Windows uses, not like the primary thread's stack.

Touching memory multiple pages below the bottom of a MAP_GROWSDOWN region might segfault (unlike with Linux's primary-thread stack). Compilers targeting Linux don't generate stack "probes" to make sure each 4k page is touched in order after a big allocation (e.g. local array or alloca), so that's another reason MAP_GROWSDOWN isn't safe for stacks.

Compilers do emit stack probes on Windows.

(MAP_GROWSDOWN might not even work at all, see @BeeOnRope's comment. It was never very safe to use for anything, because stack clash security vulnerabilities were possible if the mapping grows close to something else. So just don't use MAP_GROWSDOWN for anything ever. I'm leaving in the mention to describe the guard-page mechanism Windows uses, because it's interesting to know that Linux's primary-thread stack design isn't the only one possible.)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 1
    Linux doesn't use guard pages to grow the stack (and indeed didn't even have anything called "guard pages" related to the stack until relatively recently). There is no need for compilers to "probe" the stack so you can jump over map pages and touch a page near the "end" of the stack without a problem (all the intervening pages are mapped as well). Interestingly, Windows _does_ work as you describe: it has a single[0] guard page and touching that page, will trigger an expansion of the stack, and set up a new guard page. – BeeOnRope Oct 17 '17 at 19:27
  • 1
    On Windows then the compiler is responsible for that code which uses the stack never jumps over the guard page since such programs will crash rather than growing the stack. With local variables larger than 4K, this means a call to `__chkstk` is [inserted](https://godbolt.org/g/seb176) to do stack probing (or smarter) to ensure the stack is correctly allocated. Now, Linux does have something called "guard pages" between the max stack size and any adjacent VMAs, but that's just to reduce the chance that a large change in the stack pointer runs into an unrelated memory area (aka stack clash). – BeeOnRope Oct 17 '17 at 19:33
  • @BeeOnRope: Thanks. MAP_GROWSDOWN documents a guard page, so I guess that makes it doubly unsafe for stacks: If that documentation is literally true, it will fault instead of growing if a thread stack ever skips a page. – Peter Cordes Oct 18 '17 at 00:02
  • Where are you seeing that doc? As far as I can tell from reading the kernel source `MAP_GROWSDOWN` simply considers the entire space between the current VMA range for the `GROWSDOWN` mapping and the closest lower-addressed mapping as "implicitly" part of the area that can be automatically allocated by that feature. So as long as you don't jump across (or into) an earlier mapping, it seems to me it would work (this also partly explains why they don't have a MAP_GROWSUP flag: that would leave the region between a lower UP and high DOWN mapping with ambiguous ownership). – BeeOnRope Oct 18 '17 at 00:06
  • Interesting, I see the [doc](http://man7.org/linux/man-pages/man2/mmap.2.html) here, but my local `mmap` doesn't doesn't have it, simply saying: **MAP_GROWSDOWN** _Used for stacks. Indicates to the kernel virtual memory system that the mapping should extend downward in memory._ – BeeOnRope Oct 18 '17 at 00:09
  • @BeeOnRope: Your copy might be slightly old. Until recently the documentation seemed to suggest it was safe. So do several SO answers (probably influenced by that). Updated my answer, thanks for the info that Linux doesn't use guard pages at all for stacks. – Peter Cordes Oct 18 '17 at 00:18
  • It looks like perhaps it was added in man pages for [4.09](http://man7.org/linux/man-pages/changelog.html#release_4.09). Even though I'm on 4.10, my man page version is stuck in 4.04. – BeeOnRope Oct 18 '17 at 00:18
  • 2
    Peter Cordes. I've looked into it more, and the answer seems to be "it's complex, but the documentation is probably wrong". On my box, allocating large amounts on the stack and jumping deep into it (i.e., a much lower address) skipping many pages works fine. That's consistent with my checks in the kernel source. On my box `MAP_GROWNDOWN` doesn't work at all: it always faults when accessing below the mapped region using [code like this](https://unix.stackexchange.com/a/79256/87246). This seems like maybe a [new bug](https://patchwork.kernel.org/patch/9802797/). – BeeOnRope Oct 18 '17 at 03:22
  • 3
    As far as I can tell, there were basically two flows through the kernel: the one that hits the guard page, which ends up in `__do_anonymous_page` and the flow when you skip over the guard page, which ends up [here in `__do_page_fault` for x86](https://patchwork.kernel.org/patch/9796395/). There you can see that the code handles the `MAP_GROWSDOWN` case with a check of `rsp`: so you can't at all use this as a general "grows down" area since the kernel is actually checking that `rsp` is "close to" this area, otherwise it will fault. – BeeOnRope Oct 18 '17 at 03:28
  • What seems to have changed lately is that the old "single" guard page was just kind of added onto the stack's VMA, but now that they moved to a 1MB guard to mitigate stack clash, they account for it totally differently (so it won't get charged to the process ulimits, among other things), which made all the flows now go through `__do_page_fault`, causing the regression linked above, or something like that. All that to say I don't see evidence for the behavior described by the man page: under the covers the guard page may be hit, but the behavior is supposed to be the same as the skip over case. – BeeOnRope Oct 18 '17 at 03:31
  • 1
    [Here's the change](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1be7107fbe18eed3e319a6c3e83c78254b693acb) to the stack guard page handling, which led to a bunch of follow-on issues, like [this one](http://seclists.org/oss-sec/2017/q2/562). – BeeOnRope Oct 18 '17 at 03:36
  • 4
    Finally, this also answers one question you had above: the region which is considered the "stack growth region" seems to be arbitrarily large, as long as `rsp` is adjusted first (which compilers do, of course). I was able to write 1 GB beyond the current allocated stack (with `ulimit -s unlimited`) and Linux was happy to grow the stack to 1 GB. This only works because the primary process stack lives at the top of the VM space with about 10 TB before it hits anything else: this won't work with `pthreads` threads which have a fixed stack size that doesn't use the `GROWDOWN` stuff at all. – BeeOnRope Oct 18 '17 at 03:50
  • 2
    @BeeOnRope: Thanks for all the research, linked to several of these comments from my answer. – Peter Cordes Oct 18 '17 at 03:55
  • "Compilers targeting Linux don't generate stack "probes" to make sure each 4k page is touched in order after a big allocation (e.g. local array or alloca), so that's another reason MAP_GROWSDOWN isn't safe for stacks" - I think there is something off here. Provided your kernel has the pre-stackclash fix for [CVE-2010-2240](https://bugzilla.redhat.com/show_bug.cgi?id=CVE-2010-2240), there will be a guard page or guard gap. Unless your kernel got a screwed up version of the 2017 stackclash fixes (which increased the gap to 256 pages). Note pthread_attr_setguardsize() default is PAGE_SIZE :-). – sourcejedi Jul 07 '19 at 11:45
  • @sourcejedi: that guard page isn't to trigger growth, it's to prevent anything *else* from allocating that page. And to SIGBUS if anything tries to access it. For thread stacks, the full threadstack mapping is done up-front, so no growth is needed (just lazy allocation of logical mappings vs. backed by physical page and wired into the HW page tables). See [What is the actual size of stack red zone?](//stackoverflow.com/posts/comments/100386001) for a recent test that you can still grow the main stack arbitrarily by moving RSP very far and touching mem. – Peter Cordes Jul 07 '19 at 11:59
  • @PeterCordes my comment is not concerned with triggering growth specifically, only whether access beyond the stack will be detected, i.e. "safety". – sourcejedi Jul 07 '19 at 12:01
  • @sourcejedi: Accessing unmapped pages below the red-zone will always fault. You only need guard pages to make sure nothing *else* maps that memory below the bottom of the actual stack proper, creating a non-fault (stack clash). The reason Linux upped it to 256 guard pages is precisely *because* a huge overflow of a VLA could skip over a single guard page. If Linux did require stack probes while growing the stack by more than 1 page, you wouldn't need more than 1 guard page for normal compiler-generated code. (I haven't read the details recently on these guard pages; I think that's right) – Peter Cordes Jul 07 '19 at 12:10
2

Stack allocation uses same virtual memory mechanism which controls address access pagefault. I.e. if your current stack has 7ffd41ad2000-7ffd41af3000 as bounds:

myaut@panther:~> grep stack /proc/self/maps                                                     
7ffd41ad2000-7ffd41af3000 rw-p 00000000 00:00 0      [stack]

Then if CPU will try to read/write data at address 7ffd41ad1fff (1 byte before stack top boundary), it will generate a pagefault because OS didn't provide a corresponding chunk of allocated memory (page). So push or any other memory-accessing command with %rsp as address will trigger pagefault.

In the pagefault handler, kernel will check if stack can be grown and if so, it will allocate page backing faulty address (7ffd41ad1000-7ffd41ad2000) or trigger SIGSEGV if, say, stack ulimit is exceeded.

myaut
  • 11,174
  • 2
  • 30
  • 62