How is the multi-threaded code implemented as raw machine instructions?

Question

I want to understand what does the multi-threaded code look like after compilation and how does the CPU execute it (assume the machine has one single-core CPU). Consider the following toy example:

// Computes a large Fibonacci number and prints it.
static void* Fibonacci(void* arguments) {
    ...
}

// Reads user input from terminal and prints it
static void* UserInput(void* arguments) {
    ...
}

int main() {
    pthread_t prime_thread;
    pthread_t input_thread;
    pthread_create(&prime_thread, NULL, Fibonacci, NULL);
    pthread_create(&input_thread, NULL, UserInput, NULL);
    pthread_join(prime_thread, NULL);
    pthread_join(input_thread, NULL);
    return 0;
}

I created 2 threads, one to do CPU-intensive computation of Fibonacci numbers, another to wait for the user info. When I compile the code with gcc main.c -pthread everything gets compiled into a single executable binary file. Thus, I assume that after starting this program, the CPU will execute instructions written in that binary one by one with possible jumps to subroutines.

I've checked the assembly code for this program and in a nutshell, it looks like this:

_ZL9FibonacciPv:
    # Fibonacci function implementation
    ...
_ZL9UserInputPv:
    # UserInput function implementation
    ...
main:
    # If I understand correctly, here we prepare the arguments
    # (mainly the pointer to Fibonnacci function) to create pthread_t
    sub rsp, 40 #,
    lea rdx, _ZL9FibonacciPv[rip]   #,
    xor ecx, ecx    #
    lea rdi, 8[rsp] # tmp90,
    xor esi, esi    #
    mov rax, QWORD PTR fs:40    # tmp95,
    mov QWORD PTR 24[rsp], rax  # D.4751, tmp95
    xor eax, eax    # tmp95

    # Here we create a pthread_t to execute the Fibonacci function
    call    pthread_create@PLT  #

    # Here we prepare the arguments for another pthread_t
    lea rdx, _ZL9UserInputPv[rip]   #,
    lea rdi, 16[rsp]    # tmp91,
    xor ecx, ecx    #
    xor esi, esi    #

    # And create a second pthread_t to execute the UserInput function
    call    pthread_create@PLT  #

    # Here we do something to join the threads
    mov rdi, QWORD PTR 8[rsp]   #, prime_thread
    xor esi, esi    #
    call    pthread_join@PLT    #
    mov rdi, QWORD PTR 16[rsp]  #, input_thread
    xor esi, esi    #
    call    pthread_join@PLT    #

    # Main function terminates
    xor eax, eax    #
    add rsp, 40 #,
    ret

What confuses me is the following:

These assembly instructions will be executed one by one. I assume the call pthread_create@PLT and call pthread_join@PLT will at the end return to the call site. Thus, in the end, the program counter will be set to execute these final 3 instructions:

xor eax, eax
add rsp, 40,
ret

which indicate the exit of the main function. I see no parallelism in this code execution, so how do the 2 threads get executed simultaneously here? Does it mean that after the final ret instruction a program counter is set to some memory address invisible in this assembly code and the program does not actually terminate but start executing those threads?

The 2 Threads achieve parallelism by running on hardware not through software. You won't be able to read parallelism, all this code is loaded into the same address space, where the OS provides you with the construct of a thread which is just an abstraction for running something with a different program Stack (either mimicing parallelism via time slicing on one core or via true multi-core execution) — PragmaticProgrammer, Oct 21 '20 at 13:04
Have you read [What does multicore assembly language look like?](https://stackoverflow.com/q/980999) Is your question a duplicate of it? `call pthread_create` results in a system call that starts code running on another core, that's why its args include a function pointer. — Peter Cordes, Oct 21 '20 at 13:37
Unrelated to your question about thread-level parallelism, but *These assembly instructions will be executed one by one.* - Logically yes, but physically modern CPUs are superscalar and a single core will look for [cases where it can run multiple instructions in parallel](https://softwareengineering.stackexchange.com/questions/349972/how-does-a-single-thread-run-on-multiple-cores/350024#350024) without breaking the illusion of running serially. This is called "instruction-level parallelism", and finding + exploiting it is how CPUs achieve a high instructions-per-cycle (IPC). — Peter Cordes, Oct 21 '20 at 13:42
Have a look at the concept of [interrupts](https://en.wikipedia.org/wiki/Interrupt), [preemption](https://en.wikipedia.org/wiki/Preemption_(computing)), and [time slices](https://en.wikipedia.org/wiki/Preemption_(computing)#Time_slice), and also note that multicore processors have multiple program counters. — Erik Eidt, Oct 21 '20 at 13:58
@PeterCordes no it's not a duplicate, that question is considering multi-core machine. I updated my question, here I am interested in a single core. — mercury0114, Oct 21 '20 at 15:29
Do you know how an OS's task scheduler works? https://en.wikipedia.org/wiki/Preemption_(computing)#Preemptive_multitasking `pthread_create` really just creates another software thread, then `pthread_join` waits for it, if it hasn't exited on its own before the main thread runs that. — Peter Cordes, Oct 21 '20 at 23:42
Hmmm.. Iooked at the three dupe Q@A links. I cannot say that I was impressed by them:( The middle one has a +8 answer that is fixated on the timer interrupt, even though it is perfectly possible, (though not patricularly practical), to build a preemptive tasker without a timer, (ie. relying only on I/O completion interrupts). — Martin James, Oct 22 '20 at 07:29

How is the multi-threaded code implemented as raw machine instructions?

0 Answers0