Who creates and owns the call stack and how does call stack works in multithread?

Question

I know that each thread usually have one call stack, which is just a chunk of memory and controlled by using esp and ebp.

1, how are these call stacks created and who's responsible for doing this? My guess is the runtime, for example Swift runtime for iOS application. And is it the thread directly talks to its own call stack by esp and ebp or through runtime?

2, for each call stack, they have to work with the esp and ebb cpu registers, if I have a CPU with 2 cores 4 threads, then let's say it has 4 cores (instruction sets). Does it mean that each call stack gonna be working with these registers in specific core only?

Each thread has its own stack, allocated in the process's address space by `mmap` or something. And yes each software thread has its own architectural state, including RSP on x86-64, or [`sp` on AArch64](http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0801a/BABFAJBA.html). (Or ESP if you make obsolete 32-bit x86 code). I assume frame pointers are optional for swift. And yes each logical core has its own architectural state (registers); software threads are context-switched onto HW logical cores. — Peter Cordes, Oct 16 '19 at 04:12
To follow up on @PeterCordes comment, each thread has it's own call stack, independently of the number of cores and or hyper-threading. In the case of Windows, threads may be switched between cores on a time slice boundary or any event triggered wakeup, to balance the load. — rcgldr, Oct 16 '19 at 04:14
Don't forget that the stack pointer (what you called, ESP) and the frame pointer (EBP) can be loaded and saved just like any other registers on pretty much all architectures. The complete state of a processor, including its ESP and EBP, is called the "execution context." Switching a CPU from one thread to another simply is a matter of saving the execution context of the current thread, and restoring the execution context of a different thread. Also, when a thread's context is restored, it does not necessarily have to be restored to the same processor that it was saved from. — Solomon Slow, Oct 16 '19 at 14:06

Peter Cordes · Answer 1 · 2019-10-16T04:32:25.650

2

(I'm assuming Swift threading is just like threads in other languages. There really aren't many good options, either normal OS-level threads or user-space "green threads", or a mix of both. The difference is only where context switches happen; main concepts are still the same)

Each thread has its own stack, allocated in the process's address space by mmap or something by the parent thread, or maybe by the same system call that creates the thread. IDK iOS system calls. In Linux you have to pass a void *child_stack to the Linux-specific clone(2) system call that actually creates a new thread. It's very rare to use low-level OS-specific system calls directly; a language runtime would probably do threading on top of pthreads functions like pthread_create, and that pthreads library would handle the OS-specific details.

And yes each software thread has its own architectural state, including RSP on x86-64, or sp on AArch64. (Or ESP if you make obsolete 32-bit x86 code). I assume frame pointers are optional for swift.

And yes each logical core has its own architectural state (registers including a stack pointer); a software thread runs on a logical core, and context switches between software threads save/restore registers. Related, maybe a duplicate of What resources are shared between threads?.

Software threads share the same page tables (virtual address space), but not registers.

edited Oct 16 '19 at 04:32

answered Oct 16 '19 at 04:16

Peter Cordes

328,167
45
605
847

is it the os created the stack or the runtime? – Tony Lin Oct 16 '19 at 04:20
"Each thread has its own stack, allocated in the process's address space by mmap or something."... by the operating system, and the OS associates the stack with the thread. – Erik Eidt Oct 16 '19 at 04:21
@TonyLin: I don't know iOS system calls. On Linux you'd normally `mmap` an 8MB block of memory and pass it to the thread-creation system call for it to create the thread with the stack pointer pointing at the top of it. But any wrapper library for threads would handle that for you. It's possible that on iOS there's a system call that creates a thread and allocates a stack for it, that would be a reasonable design if you pass a NULL pointer for the initial thread-stack or something. – Peter Cordes Oct 16 '19 at 04:22
@ErikEidt: "the OS associates the stack with the thread". I don't think that's the case in Linux; is that true in MacOS or iOS? In Linux, I don't think the kernel has anything that keeps track of which thread is using with allocated region of memory. Obviously the saved SP / RSP value will point somewhere in there... unless user-space is temporarily using RSP as scratch space instead of a stack pointer. But other than that I don't think there's anything. You can use `sigaction` to tell the kernel about a different stack for signal handling, otherwise it defaults to the current RSP. – Peter Cordes Oct 16 '19 at 04:26
@ErikEidt: or does [Linux `clone(2)`](http://man7.org/linux/man-pages/man2/clone.2.html) save the `void *child_stack` somewhere? I'm not sure it would need to, but it might. Anyway, thread stacks could all be separate chunks of a larger allocation, but then there's no guard page to stop stack clashes. – Peter Cordes Oct 16 '19 at 04:29

Kamil.S · Accepted Answer · 2019-10-21T11:24:10.620

XNU kernel does it. Swift threads are POSIX pthreads aka Mach threads. During program startup the XNU kernel parses Mach-O executable format and handles either modern LC_MAIN or legacy LC_UNIXTHREAD load command (among others). This is handled in kernel functions:

static
load_return_t
load_main(
        struct entry_point_command  *epc,
        thread_t        thread,
        int64_t             slide,
        load_result_t       *result
    )

&

static
load_return_t
load_unixthread(
    struct thread_command   *tcp,
    thread_t        thread,
    int64_t             slide,
    load_result_t       *result
)

which do happen to be open source

LC_MAIN initialises the stack through thread_userstackdefault

LC_UNIXTHREAD through load_threadstack.

As @PeterCordes mentions in comments only when the kernel creates the main thread the started process itself can spawn child threads from it's own main thread either through some api like GCD or directly through syscall (bsdthread_create, not sure if any others). The syscall happens to have user_addr_t stack as it's 3rd argument (i.e. rdx in x86-64 System V kernel ABI used by MacOS). Reference for MacOS syscalls
I haven't thoroughly investigated details of this particular stack argument, but I would imagine it's similar to thread_userstackdefault / load_threadstack approach.

I do believe your doubt about Swift runtime responsibility may arise due to frequent mentions of data structures (like Swift struct - no pun intended) being stored on the stack (which is btw implementation detail and not guaranteed feature of the runtime).

Update:
He's an example main.swift command line program ilustrating the idea.

import Foundation

struct testStruct {
    var a: Int
}

class testClass {
}

func testLocalVariables() {
    print("main thread function with local varablies")
    var struct1 = testStruct(a: 5)
    withUnsafeBytes(of: &struct1) { print($0) }
    var classInstance = testClass()
    print(NSString(format: "%p", unsafeBitCast(classInstance, to: Int.self)))
}
testLocalVariables()

print("Main thread", Thread.isMainThread)
var struct1 = testStruct(a: 5)
var struct1Copy = struct1

withUnsafeBytes(of: &struct1) { print($0) }
withUnsafeBytes(of: &struct1Copy) { print($0) }

var string = "testString"
var stringCopy = string

withUnsafeBytes(of: &string) { print($0) }
withUnsafeBytes(of: &stringCopy) { print($0) }

var classInstance = testClass()
var classInstanceAssignment = classInstance
var classInstance2 = testClass()

print(NSString(format: "%p", unsafeBitCast(classInstance, to: Int.self)))
print(NSString(format: "%p", unsafeBitCast(classInstanceAssignment, to: Int.self)))
print(NSString(format: "%p", unsafeBitCast(classInstance2, to: Int.self)))

DispatchQueue.global(qos: .background).async {
    print("Child thread", Thread.isMainThread)
    withUnsafeBytes(of: &struct1) { print($0) }
    withUnsafeBytes(of: &struct1Copy) { print($0) }
    withUnsafeBytes(of: &string) { print($0) }
    withUnsafeBytes(of: &stringCopy) { print($0) }
    print(NSString(format: "%p", unsafeBitCast(classInstance, to: Int.self)))
    print(NSString(format: "%p", unsafeBitCast(classInstanceAssignment, to: Int.self)))
    print(NSString(format: "%p", unsafeBitCast(classInstance2, to: Int.self)))
}

//Keep main thread alive indefinitely so that process doesn't exit
CFRunLoopRun()

My output looks like this:

main thread function with local varablies
UnsafeRawBufferPointer(start: 0x00007ffeefbfeff8, count: 8)
0x7fcd0940cd30
Main thread true
UnsafeRawBufferPointer(start: 0x000000010058a6f0, count: 8)
UnsafeRawBufferPointer(start: 0x000000010058a6f8, count: 8)
UnsafeRawBufferPointer(start: 0x000000010058a700, count: 16)
UnsafeRawBufferPointer(start: 0x000000010058a710, count: 16)
0x7fcd0940cd40
0x7fcd0940cd40
0x7fcd0940c900
Child thread false
UnsafeRawBufferPointer(start: 0x000000010058a6f0, count: 8)
UnsafeRawBufferPointer(start: 0x000000010058a6f8, count: 8)
UnsafeRawBufferPointer(start: 0x000000010058a700, count: 16)
UnsafeRawBufferPointer(start: 0x000000010058a710, count: 16)
0x7fcd0940cd40
0x7fcd0940cd40
0x7fcd0940c900

Now we can observe a couple of interesting things:

Class instances clearly occupy a different part of memory than Structs
Assigning a struct to a new variable makes a copy to a new memory address
Assigning class instance just copies the pointer.
Both main thread and child thread when referring to global Structs point to exactly same memory
Strings do have a Struct container.

Update2 - proof of 4^ We can actually inspect the memory underneath:

x 0x10058a6f0 -c 8
0x10058a6f0: 05 00 00 00 00 00 00 00                          ........
x 0x10058a6f8 -c 8
0x10058a6f8: 05 00 00 00 00 00 00 00                          ........

So this definitely is the actual struct raw data i.e. the struct itself.

Update 3

I added a testLocalVariables() function, to distinguish between Swift Struct defined as global and local variables. In this case

x 0x00007ffeefbfeff8 -c 8
0x7ffeefbfeff8: 05 00 00 00 00 00 00 00                          ........

it clearly lives on the thread stack.

Last but not least when in lldb I do:

re read rsp
rsp = 0x00007ffeefbfefc0  from main thread
re read rsp
rsp = 0x000070000291ea40  from child thread

it yields different value for each thread, so the thread stacks are clearly distinct.

Digging further
There's a handy memory region lldb command which sheds some light what's going on.

memory region 0x000000010058a6f0
[0x000000010053d000-0x000000010058b000) rw- __DATA

So global Structs sit in preallocated executable writeable __DATA memory page (same one where your global variables live). Same command for class 0x7fcd0940cd40 address isn't as spectacular (I reckon because that's a dynamically allocated heap). Analogously for thread stack address 0x7ffeefbfefc0 which clearly isn't a process memory region.

Fortunately there is one last tool to further go down the rabbit hole.
vmmap -v -purge pid which does confirm classes sit in MALLOC_ed heap and likewise a thread stack (at least for main thread) can be cross referenced to Stack.

Somewhat related question is also here.

HTH

Are you saying Swift threads aren't just started by the main thread after process creation? Like that it's visible in the Mach-O metadata, without disassembling the code and finding calls to `pthread_create`, and the kernel starts them for you as part of process creation? In the normal POSIX model, `main` or its children create additional threads *after* process startup. — Peter Cordes, Oct 16 '19 at 19:22
Or is `LC_MAIN` vs. `LC_UNIXTHREAD` just a different style for the main / initial thread? (with additional threads created manually via a system call.) — Peter Cordes, Oct 16 '19 at 19:24
@PeterCordes they are started by the main thread exactly as you say. It doesn't really matter if they are Swift or not, it's exactly the same BSD mechanism underneath. I'll update the subsequent threads part in my answer to clear the confusion for future readers — Kamil.S, Oct 16 '19 at 19:46
Ah I see now. You were talking about creating the stack for the *main* thread in most of your answer. I ignored that completely in my answer, because it needs to be created before entering user-space. (The SysV ABI passes argc, argv, and envp to user-space *on* the stack; the stack-pointer at the process entry point points to `argc`, not to a return address. Otherwise it would be theoretically possible (but horrible) to require the main thread to create a stack for itself with `mmap` and `lea rsp, [rax + 8MB]`, except in 32-bit BSD where the system-calling convention passes args on stack) — Peter Cordes, Oct 16 '19 at 19:51
@Kamil.S Everything makes lots of sense until I see the `struct` in swift stored on the stack and this stack is different from the thread stacks and it's process swift runtime stack. Could you please elaborate this? or could you share some links regarding this? I am more confused about this now because I used to think that there's only one type of stack which is `thread stack`. — Tony Lin, Oct 16 '19 at 23:29
@Kamil.S Thanks a lot for your great answer and experiment. I got couple of questions. From the test you did: 1, why `Dispatch.main.async` you call it `child thread`? it seems this is on main thread, which should be same as outside of the block. 2, why does assigning a `struct` to a new variable makes a copy of new address? I remember swift has something called `copy on write` meaning it only makes a copy when anything changes. 3, my biggest confusion is still the why struct in different thread has same address unless it's same stack. but what about people say different thread has diff stacks? — Tony Lin, Oct 21 '19 at 09:00
@TonyLin Ad1 you're very right about `Dispatch.main.async` , I corrected to `DispatchQueue.global(qos: .background).async {` , the results are the same though. Ad2 I'll provide more info about `copy on write` in action, just need some time Ad3 check update2 — Kamil.S, Oct 21 '19 at 10:45
@TonyLin I may have drawn some misleading conclusions which I tried to iron out with update3. However the main conclusion is we cannot treat the `Struct` live on (thread) stack as mantra as clearly it might not be the case. — Kamil.S, Oct 21 '19 at 11:22

Who creates and owns the call stack and how does call stack works in multithread?

2 Answers2