How to mmap the stack for the clone() system call on linux?

Question

The clone() system call on Linux takes a parameter pointing to the stack for the new created thread to use. The obvious way to do this is to simply malloc some space and pass that, but then you have to be sure you've malloc'd as much stack space as that thread will ever use (hard to predict).

I remembered that when using pthreads I didn't have to do this, so I was curious what it did instead. I came across this site which explains, "The best solution, used by the Linux pthreads implementation, is to use mmap to allocate memory, with flags specifying a region of memory which is allocated as it is used. This way, memory is allocated for the stack as it is needed, and a segmentation violation will occur if the system is unable to allocate additional memory."

The only context I've ever heard mmap used in is for mapping files into memory, and indeed reading the mmap man page it takes a file descriptor. How can this be used for allocating a stack of dynamic length to give to clone()? Is that site just crazy? ;)

In either case, doesn't the kernel need to know how to find a free bunch of memory for a new stack anyway, since that's something it has to do all the time as the user launches new processes? Why does a stack pointer even need to be specified in the first place if the kernel can already figure this out?

Related: [How is Stack memory allocated when using 'push' or 'sub' x86 instructions?](https://stackoverflow.com/q/46790666) describes the growth mechanism for the main-thread stack, and why it can't be used for thread stacks, and what pthreads does instead. — Peter Cordes, Oct 20 '21 at 03:06

R.. GitHub STOP HELPING ICE · Answer 1 · 2017-02-03T16:24:44.393

Stacks are not, and never can be, unlimited in their space for growth. Like everything else, they live in the process's virtual address space, and the amount by which they can grow is always limited by the distance to the adjacent mapped memory region.

When people talk about the stack growing dynamically, what they might mean is one of two things:

Pages of the stack might be copy-on-write zero pages, which do not get private copies made until the first write is performed.
Lower parts of the stack region may not yet be reserved (and thus not count towards the process's commit charge, i.e. the amount of physical memory/swap the kernel has accounted for as reserved for the process) until a guard page is hit, in which case the kernel commits more and moves the guard page, or kills the process if there is no memory left to commit.

Trying to rely on the MAP_GROWSDOWN flag is unreliable and dangerous because it cannot protect you against mmap creating a new mapping just adjacent to your stack, which will then get clobbered. (See http://lwn.net/Articles/294001/) For the main thread, the kernel automatically reserves the stack-size ulimit worth of address space (not memory) below the stack and prevents mmap from allocating it. (But beware! Some broken vendor-patched kernels disable this behavior leading to random memory corruption!) For other threads, you simply must mmap the entire range of address space the thread might need for stack when creating it. There is no other way. You could make most of it initially non-writable/non-readable, and change that on faults, but then you'd need signal handlers and this solution is not acceptable in a POSIX threads implementation because it would interfere with the application's signal handlers. (Note that, as an extension, the kernel could offer special MAP_ flags to deliver a different signal instead of SIGSEGV on illegal access to the mapping, and then the threads implementation could catch and act on this signal. But Linux at present has no such feature.)

Finally, note that the clone syscall does not take a stack pointer argument because it does not need it. The syscall must be performed from assembly code, because the userspace wrapper is required to change the stack pointer in the "child" thread to point to the desired stack, and avoid writing anything to the parent's stack.

Actually, clone does take a stack pointer argument, because it's unsafe to wait to change stack pointer in the "child" after returning to userspace. Unless signals are all blocked, a signal handler could run immediately on the wrong stack, and on some architectures the stack pointer must be valid and point to an area safe to write at all times.

Not only is modifying the stack pointer impossible from C, but you also couldn't avoid the possibility that the compiler would clobber the parent's stack after the syscall but before the stack pointer was changed.

My understanding is `MAP_GROWSDOWN` was belatedly fixed: [CVE-2010-2240](https://bugzilla.redhat.com/show_bug.cgi?id=CVE-2010-2240). In the later [2017 fix](https://bugzilla.redhat.com/show_bug.cgi?id=1461333) for [Stack Clash](https://blog.qualys.com/securitylabs/2017/06/19/the-stack-clash), `MAP_GROWSDOWN` reserves a larger guard gap of 256 pages (1MiB on x86). It is still widely used for the main thread stack anyway. But for threads, I think it is better practice to use fixed size stacks with manual guard mappings - more reliable (deterministic) and portable (v.s. 32-bit VM exhaustion). — sourcejedi, Jul 07 '19 at 11:20
If we talk about danger, we should note that 1) the default [guard mapping in pthreads](http://man7.org/linux/man-pages/man3/pthread_attr_setguardsize.3.html) is still only one page, 2) although gcc has an option that might avoid accidentally "jumping over" the guard page, it is not enabled by default, and the documentation is not very confident. "[`fstack-clash-protection` may also provide limited protection for static stack allocations if the target supports `-fstack-check=specific`](https://gcc.gnu.org/onlinedocs/gcc-9.1.0/gcc/Instrumentation-Options.html#index-fstack-check)". — sourcejedi, Jul 07 '19 at 11:31

nos · Answer 2 · 2009-07-04T23:55:01.207

5

You'd want the MAP_ANONYMOUS flag for mmap. And the MAP_GROWSDOWN since you want to make use it as a stack.

Something like:

void *stack = mmap(NULL,initial_stacksize,PROT_WRITE|PROT_READ,MAP_PRIVATE|MAP_GROWSDOWN|MAP_ANONYMOUS,-1,0);

See the mmap man page for more info. And remember, clone is a low level concept, that you're not meant to use unless you really need what it offers. And it offers a lot of control - like setting it's own stack - just in case you want to do some trickering(like having the stack accessible in all the related processes). Unless you have very good reason to use clone, stick with fork or pthreads.

edited Jul 04 '09 at 23:55

answered Jul 04 '09 at 23:46

nos

223,662
58
417
506

How does this get you a dynamically growing stack though? Don't you still have to specify a length? Or do implementations like pthreads pass a gigantic length and rely on copy on write? – Joseph Garvin Jul 05 '09 at 00:37
Yes, they rely on copy on write. I'm not sure how big the pthread stack size is now, it used to be 2Mb by default - you can alter it with the ulimit -s command. – nos Jul 05 '09 at 10:58
Ok, testing with pthread_attr_getstacksize suggests the default stack size is 10485760 bytes nowadays, and – nos Jul 05 '09 at 11:17
I think your comment was cut off after "and". – Joseph Garvin May 03 '10 at 14:12
2

`MAP_GROWSDOWN` is dangerous and should never be used. See http://lwn.net/Articles/294001/ – R.. GitHub STOP HELPING ICE Mar 20 '11 at 14:08
On the [man page of clone](https://man7.org/linux/man-pages/man2/clone.2.html), the sample code employes `MAP_STACK` instead of `MAP_GROWSDOWN` – TJM Jul 13 '21 at 08:52

score 2 · Accepted Answer · answered Jul 09 '09 at 15:02

Joseph, in answer to your last question:

When a user creates a "normal" new process, that's done by fork(). In this case, the kernel doesn't have to worry about creating a new stack at all, because the new process is a complete duplicate of the old one, right down to the stack.

If the user replaces the currently running process using exec(), then the kernel does need to create a new stack - but in this case that's easy, because it gets to start from a blank slate. exec() wipes out the memory space of the process and reinitialises it, so the kernel gets to say "after exec(), the stack always lives HERE".

If, however, we use clone(), then we can say that the new process will share a memory space with the old process (CLONE_VM). In this situation, the kernel can't leave the stack as it was in the calling process (like fork() does), because then our two processes would be stomping on each other's stack. The kernel also can't just put it in a default location (like exec()) does, because that location is already taken in this memory space. The only solution is to allow the calling process to find a place for it, which is what it does.

score 2 · Answer 4 · edited Sep 03 '19 at 03:33

Here is the code, which mmaps a stack region and instructs the clone system call to use this region as the stack.

#include <sys/mman.h>
#include <stdio.h>
#include <string.h>
#include <sched.h>

int execute_clone(void *arg)
{
    printf("\nclone function Executed....Sleeping\n");
    fflush(stdout);
    return 0;
}

int main()
{
    void *ptr;
    int rc;
    void *start =(void *) 0x0000010000000000;
    size_t len = 0x0000000000200000;

    ptr = mmap(start, len, PROT_WRITE, MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED|MAP_GROWSDOWN, 0, 0);
    if(ptr == (void *)-1) 
    {
        perror("\nmmap failed");
    }

    rc = clone(&execute_clone, ptr + len, CLONE_VM, NULL);

    if(rc <= 0) 
    {
        perror("\nClone() failed");
    }
}

score 0 · Answer 5 · answered Jul 04 '09 at 23:43

0

mmap is more than just mapping a file into memory. In fact, some malloc implementations will use mmap for large allocations. If you read the fine man page you'll notice the MAP_ANONYMOUS flag, and you'll see that you need not need supply a file descriptor at all.

As for why the kernel can't just "find a bunch of free memory", well if you want someone to do that work for you, either use fork instead, or use pthreads.

answered Jul 04 '09 at 23:43

Logan Capaldo

39,555
5
63
78

My point is that it should be able to "find a bunch of free memory" because it apparently it *already can* "find a bunch of free memory." Fork creates a new process, which is different, and I know I could abstract any detail away by using a library. But I'm giving the kernel developers credit and assuming there's good reason for things to work this way, and I want to know why. – Joseph Garvin Jul 05 '09 at 01:43
fork (exec really, since fork just copies everything) are the "find me a bunch of free memory" functions. `clone` is the "I want to control the details of my process creation" function. pthread_create is the "create me a thread, use the defaults" function. These are your choices. New threads need their own stack, and you can't use the traditional method of allocating stack (start at the top/bottom of the (user) address space and grown down/up towards the heap which is growing the other way), because there's only one top/bottom of the address space. – Logan Capaldo Jul 05 '09 at 01:56
My point is that when the user forks a process, even though that process gets its own memory space, at some level the kernel has to be doing memory management of the *physical* address space that the process-specific address spaces map into. If it can do that, it should have the logic to be able to handle letting me clone but handle the stack for me. I might want to use clone for reasons that don't have anything to do with stacks for example (see the clone flags). – Joseph Garvin Jul 05 '09 at 02:35
Mapping physical address space to virtual address space has nothing to do with allocating memory for the stack. I don't see why you think they're related? – Logan Capaldo Jul 05 '09 at 03:06
@Logan: Sure it does ;) When I launch a new process, it needs a stack. Even though the process has its own virtual address space, its stack (and therefore the beginning of its real address space) has to correspond to somewhere in the physical address space. Multiple processes with variable size stacks and heaps means the kernel has to be doing this memory management already. Unless I'm forgetting something... – Joseph Garvin Jul 05 '09 at 05:24
4

The kernel does memory management on a lower layer. You can tell it to use 100Mb as a stack. It won't use a single byte of that 100Mb(It's just virtual space after all) until you actually start using it, it'll fault in physical memory pages that's accessed. You'll use only as much memory of the stack that's needed and it'll "grow" within the size of the mmap. The bad thing ofcourse, is you need to set a fixed size stack that cannot grow. physically. Some OS's let you specify flags to mmap that allows it to grow automatically., but last I looked, which is quite some years ago, linux did not. – nos Jul 05 '09 at 11:08
1

Joseph, noselasd is correct here. Mapping virtual to physical memory (and swap) happens independently of of whether or not the memory is intended to be used a stack or heap or something else. That part of the kernel doesn't need to be aware of that distinction. – Logan Capaldo Jul 05 '09 at 13:25
If the kernel is smart enough to not use it unless it's used, why not always specify the maximum? (all of memory) – Joseph Garvin Jul 05 '09 at 16:14
@noslead: Also, I assume the situation you're describing is if you pass a pointer produced by mmap as the stack. – Joseph Garvin Jul 05 '09 at 16:15
2

@joseph, cause the virtual memory space if finite. There's e.g. shared libraries, they're mmapped into the virtual memory space. There's the executable code itself, there's the data space(global variables, malloced memory) - a somewhat special map that can be extended with the sbrk system call. And there's mmapped files that maybe the application wants to map into memory too. These mmaps cannot overlap, and they need to have different protections (read/write/exec).. Sure you could specify all available memory, but that would clash with the space needed for shared libs, and dynamic memory – nos Jul 05 '09 at 16:56
http://duartes.org/gustavo/blog/post/anatomy-of-a-program-in-memory probably a better overview – nos Jul 05 '09 at 17:01
@nos: In addition to that, **commit charge** is finite. Unless you leave overcommit enabled (which means your server is going to crash when it gets overloaded), the kernel keeps detailed accounting of the maximum private writable (or read-only-but-already-modified) memory held by each process, and ensures that it can never exceed the size of swap plus a percentage (default 50%) of ram. This ensures that programs can never crash with OOM when accessing already-allocated memory. If you allocated 100mb for each thread stack, you'd run out in a hurry... – R.. GitHub STOP HELPING ICE Mar 20 '11 at 14:43

score 0 · Answer 6 · answered Jul 06 '09 at 21:18

0

Note that the clone system call doesn't take an argument for the stack location. It actually works just like fork. It's just the glibc wrapper which takes that argument.

answered Jul 06 '09 at 21:18

agl

1,129
5
6

1

Are you sure? Every signature I can find online for it includes a child stack. If the system call doesn't need it why does glibc? – Joseph Garvin Jul 07 '09 at 13:47
Otherwise, how would `glibc` return to you? – David Schwartz May 07 '12 at 22:10

score 0 · Answer 7 · answered Feb 13 '12 at 07:34

I think the stack grows downwards until it can not grow, for example when it grows to a memory that has been allocated before, maybe a fault is notified.That can be seen a default is the minimum available stack size, if there is redundant space downwards when the stack is full, it can grow downwards, otherwise, the system may notify a fault.

How to mmap the stack for the clone() system call on linux?

7 Answers7

Linked