3

I've been trying to piece together how stack memory is handed out to threads. I haven't been able to piece the whole thing together. I tried to go to the code, but I'm more confused, so I'm asking for your help.

I asked this question a little while ago. So assume that particular program (therefore, all threads are within the same process). If I write printfs for each beginning of stack pointer, and then how much is allocated for them, then I get stuff like the table at the end of this message, where the first column is a time_t usec, the second doesn't matter, the third is the tid of the thread, the fourth is the guard size, then begin of stack, end of stack (sorted by beginning of stack), last but one is the allocated stack (8 Megs by default) and the last column is the difference between the end of the first allocated stack, and the beginning of the next stack.

This means that (I think), if 0, then the stacks are contiguous, if positive, since the stack grows down in memory, then it means that there is "free space" of however many Mbs between a tid and the next (in memory). If negative, this means that memory is being reused. So this may mean that that stack space has been freed before this thread was created.

My problem is: what exactly is the algorithm that assigns stack space to threads (at a higher level than code) and why do I sometimes get contiguous stacks, and sometimes not, and sometimes get values like 7.94140625 and 0.0625 in the last column?

This is all Linux 2.6, C and pthreads.

This may be a question we will have to iterate on to get it right, and for this I apologize, but I'm telling you what I know right now. Feel free to ask for clarifications.

Thanks for this. The table follows.

52815   14  14786   4096    92549120    100941824   8392704 0
52481   14  14784   4096    100941824   109334528   8392704 0
51700   14  14777   4096    109334528   117727232   8392704 0
70747   14  14806   4096    117727232   126119936   8392704 8.00390625
75813   14  14824   4096    117727232   126119936   8392704 0
51464   14  14776   4096    126119936   134512640   8392704 8.00390625
76679   14  14833   4096    126119936   134512640   8392704 -4.51953125
53799   14  14791   4096    139251712   147644416   8392704 -4.90234375
52708   14  14785   4096    152784896   161177600   8392704 0
50912   14  14773   4096    161177600   169570304   8392704 0
51617   14  14775   4096    169570304   177963008   8392704 0
70028   14  14793   4096    177963008   186355712   8392704 0
51048   14  14774   4096    186355712   194748416   8392704 0
50596   14  14771   4096    194748416   203141120   8392704 8.00390625
Community
  • 1
  • 1
Dervin Thunk
  • 19,515
  • 28
  • 127
  • 217
  • Out of curiosity, why do you care? – Nemo Jul 21 '11 at 00:20
  • Note that some of your threads seem to be running on the same stack. Does that make sense? – BjoernD Jul 21 '11 at 00:35
  • 1
    @BjoernD, sure, as long as one terminates before the other one starts running – bdonlan Jul 21 '11 at 00:36
  • @Nemo. Long story short, I'm doing some research on parallel computing. I was researching the fibonacci code I linked to in this question. It gave me a SIGSEV and I thought (still think) it is because of the stack. Suddenly I realized I didn't know how the stack in Linux works, and **needed** to know... y'know... scientists, but the more I went into it, the less I understood. That's it. – Dervin Thunk Jul 21 '11 at 01:51
  • @BjoernD: bdonlan is right. I *assume* that address has already been freed when the thread returned. – Dervin Thunk Jul 21 '11 at 01:52

2 Answers2

8

First, by stracing a simple test program that launches a single thread, we can see the syscalls it used to create a new thread. Here's a simple test program:

#include <pthread.h>
#include <stdio.h>

void *test(void *x) { }

int main() {
        pthread_t thr;
        printf("start\n");
        pthread_create(&thr, NULL, test, NULL);
        pthread_join(thr, NULL);
        printf("end\n");
        return 0;
}

And the relevant portion of its strace output:

write(1, "start\n", 6start
)                  = 6
mmap2(NULL, 8392704, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0xf6e32000
brk(0)                                  = 0x8915000
brk(0x8936000)                          = 0x8936000
mprotect(0xf6e32000, 4096, PROT_NONE)   = 0
clone(child_stack=0xf7632494, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xf7632bd8, {entry_number:12, base_addr:0xf7632b70, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xf7632bd8) = 9181
futex(0xf7632bd8, FUTEX_WAIT, 9181, NULL) = -1 EAGAIN (Resource temporarily unavailable)
write(1, "end\n", 4end
)                    = 4
exit_group(0)                           = ?

We can see that it obtains a stack from mmap with PROT_READ|PROT_WRITE protection and MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK flags. It then protects the first (ie, lowest) page of the stack, to detect stack overflows. The rest of the calls aren't really relevant to the discussion at hand.

So, then, how does mmap allocate the stack, then? Well, let's start at mmap_pgoff in the Linux kernel; the entry point for the modern mmap2 syscall. It delegates to do_mmap_pgoff after taking some locks. This then calls get_unmapped_area to find an appropriate range of unmapped pages.

Unfortunately, this then calls a function pointer defined in the vma - this is probably so that 32-bit and 64-bit processes can have different ideas of which addresses can be mapped. In the case of x86, this is defined in arch_pick_mmap_layout, which switches based on whether it's using a 32-bit or 64-bit architecture for this process.

So let's look at the implementation of arch_get_unmapped_area then. It first gets some reasonable defaults for its search from find_start_end, then tests to see if the address hint passed in is valid (for thread stacks, no hint is passed). It then starts scanning through the virtual memory map, starting from a cached address, until it finds a hole. It saves the end of the hole for use in the next search, then returns the location of this hole. If it reaches the end of the address space, it starts again from the start, giving it one more chance to find an open area.

So as you can see, normally, it will assign stacks in an increasing manner (for x86; x86-64 uses arch_get_unmapped_area_topdown and will likely assign them decreasing). However, it also keeps a cache of where to start a search, so it might leave gaps depending on when areas are freed. In particular, when a mmaped area is freed, it might update the free-address-search-cache, so you might see out of order allocations there as well.

That said, this is all an implementation detail. Do not rely on any of this in your program. Just take what addresses mmap hands out and be happy :)

bdonlan
  • 224,562
  • 31
  • 268
  • 324
  • 1
    I think between us we identified all of the relevant source code. Granted, it's 90% you :-) – Nemo Jul 21 '11 at 00:36
  • @Nemo, I was going for the pthreads source too, but somehow ended up in linuxthreads and decided just to strace the thing :) – bdonlan Jul 21 '11 at 00:37
  • By the way, for kernel source references, you can link directly to [Linus's git tree](http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=tree). Which may or may not be what you want. – Nemo Jul 21 '11 at 01:00
  • @Nemo, I find LXR is nicer, as it crossreferences things for you :) – bdonlan Jul 21 '11 at 01:10
  • This is a great explanation... it's helped me a lot. Thanks. I'm not planning to use this as part of any code. I just need to understand how exactly the stack segment works. – Dervin Thunk Jul 21 '11 at 01:54
  • 1
    +1 great answer for 3 reasons: (1) recommending `strace` as the way to find the answer yourself, (2) detailed info about kernel internals, (3) advice not to rely on any of this :-) – R.. GitHub STOP HELPING ICE Jul 21 '11 at 06:53
4

glibc handles this in nptl/allocatestack.c.

Key line is:

mem = mmap (NULL, size, prot,
            MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK, -1, 0);

So it just asks the kernel for some anonymous memory, not unlike malloc does for large blocks. Which block it actually gets is up to the kernel...

Nemo
  • 70,042
  • 10
  • 116
  • 153