Why does this pthreads code consistently segfault on OS X but not on Linux?

Question

I have a task scheduling code that I want to compare with a baseline that basically creates a new pthread for each task (I know that's not a great idea, but that's why this is just the baseline for comparison). However, for some reason the pthreads version keeps giving me segfaults on OS X¹, but when I try running the same code on Linux², everything works fine.

On OS X, it occasionally completes successfully, but it usually segfaults in pthread_create, and sometimes segfaults in pthread_join instead. I also found that if I call pthread_create supplying the PTHREAD_CREATE_DETACHED attribute, and skip the pthread_joins, then the segfault problems goes away.

Included at the bottom of this question is a stripped-down version of the code that I tried to minimize as much as possible while still causing the problematic segfaults.

My question is the following:

Why is this crashing on OS X, but not on Linux?

Maybe there's a bug that I'm overlooking that happens to be benign on Linux. I'm pretty certain that the mutex and CAS operations are providing sufficient synchronization, so I don't think this is a data race issue.

Like I said, I can work around this by using PTHREAD_CREATE_DETACHED, but I'm really curious about the root cause of the segfaults. My feeling is that I'm currently overwhelming some system resources limit that isn't being freed fast enough when I require threads to be joined, but the problem is fixed for detached pthreads because they can immediately be destroyed when the thread exits; however, I'm not familiar enough with the pthread internals to confirm/refute my hypothesis.

Here's a general outline of how the code works:

We have a stack of pthreads (accessed through wait_list_head) that are currently blocked awaiting a signal on a thread-specific condition variable.
The main thread creates a single child-thread, and then waits for all the transitive children to complete (by checking for the active thread counter to reach zero).
The child thread computes Fibonacci(N=10) by creating two sub-threads to compute Fibonacci(N-1) and Fibonacci(N-2), and then joins two threads, summing their results and returning that sum as its own result. That's how all the child threads work as well, with a base case of N<2 just returning N.
Note that the blocked-thread-stack semi-randomizes which threads are joined by the parent threads. I.e., a parent thread might join the children of one of its siblings rather than joining its own children; however, the final sum will still be the same thanks to the commutativity of integer addition. Removing this "randomizing" behavior by having each parent join its own children also eliminates the segfaults.
There's also a simple purely recursive Fibonacci implementation (pure_fib) that is used to compute the expected answer for validation.

Here's a some pseudo-code of the core behavior:

Fibonacci(N):
    If N < 2:
        signal_parent(N)
    Else:
        sum = 0
        pthread_create(A, Fibonacci, N-1)
        pthread_create(B, Fibonacci, N-2)
        sum += suspend_and_join_child(); // not necessarily thread A
        sum += suspend_and_join_child(); // not necessarily thread B
        signal_parent(sum)

The minimal working example of the C code is included below.

¹ Apple LLVM version 7.0.0 (clang-700.1.76), Target: x86_64-apple-darwin14.5.0
² gcc (Ubuntu 5.4.0-6ubuntu1~16.04.2) 5.4.0 20160609

#include <assert.h>
#include <pthread.h>
#include <inttypes.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include <unistd.h>

#define N 10

#define RCHECK(expr)                                     \
    do {                                                 \
        int _rcheck_expr_return_value = expr;            \
        if (_rcheck_expr_return_value != 0) {            \
            fprintf(stderr, "FAILED CALL: " #expr "\n"); \
            abort();                                     \
        }                                                \
    } while (0);

typedef struct wait_state_st {
    volatile intptr_t val;
    pthread_t other;
    pthread_mutex_t lock;
    pthread_cond_t cond;
    struct wait_state_st *next;
} wait_state;

wait_state *volatile wait_list_head = NULL;
volatile int active = 0;

static inline void push_thread(wait_state *ws) {
    do {
        ws->next = wait_list_head;
    } while (!__sync_bool_compare_and_swap(&wait_list_head, ws->next, ws));
}

static inline wait_state *pop_thread(void) {
    wait_state *ws, *next;
    do {
        ws = wait_list_head;
        while (!ws) {
            usleep(1000);
            ws = wait_list_head;
        }
        next = ws->next;
    } while (!__sync_bool_compare_and_swap(&wait_list_head, ws, next));
    assert(ws->next == next); // check for ABA problem
    ws->next = NULL;
    return ws;
}

intptr_t thread_suspend(int count) {
    intptr_t sum = 0;
    // WAIT TO BE WOKEN UP "count" TIMES
    for (int i = 0; i < count; i++) {
        wait_state ws;
        ws.val = -1;
        ws.other = pthread_self();
        RCHECK(pthread_mutex_init(&ws.lock, NULL));
        RCHECK(pthread_cond_init(&ws.cond, NULL));

        RCHECK(pthread_mutex_lock(&ws.lock));

        push_thread(&ws);

        while (ws.val < 0) {
            RCHECK(pthread_cond_wait(&ws.cond, &ws.lock));
        }

        assert(ws.other != pthread_self());
        pthread_join(ws.other, NULL);

        sum += ws.val;

        RCHECK(pthread_mutex_unlock(&ws.lock));
    }
    return sum;
}

void thread_signal(intptr_t x) {
    // wake up the suspended thread
    __sync_fetch_and_add(&active, -1);
    wait_state *ws = pop_thread();
    RCHECK(pthread_mutex_lock(&ws->lock));
    ws->val = x;
    ws->other = pthread_self();
    RCHECK(pthread_cond_signal(&ws->cond));
    RCHECK(pthread_mutex_unlock(&ws->lock));
}

void *fib(void *arg) {
    intptr_t n = (intptr_t)arg;
    if (n > 1) {
        pthread_t t1, t2;
        __sync_fetch_and_add(&active, 2);
        RCHECK(pthread_create(&t1, NULL, fib, (void *)(n - 1)));
        RCHECK(pthread_create(&t2, NULL, fib, (void *)(n - 2)));
        intptr_t sum = thread_suspend(2);
        thread_signal(sum);
    }
    else {
        thread_signal(n);
    }
    return NULL;
}

intptr_t pure_fib(intptr_t n) {
    if (n < 2) return n;
    return pure_fib(n-1) + pure_fib(n-2);
}

int main(int argc, char *argv[]) {
    printf("EXPECTED = %" PRIdPTR "\n", pure_fib(N));
    assert("START" && wait_list_head == NULL);

    active = 1;

    pthread_t t;
    RCHECK(pthread_create(&t, NULL, fib, (void *)N));

    while (active > 0) { usleep(100000); }
    intptr_t sum = thread_suspend(1);

    printf("SUM      = %" PRIdPTR "\n", sum);
    printf("DONE %p\n", wait_list_head);

    assert("END" && wait_list_head == NULL);

    return 0;
}

Update: This Gist contains a slight variation of the above code that uses a global mutex for all the thread push/pop operations, and thus avoids the possible ABA problem with the CAS above. This version of the code still regularly segfaults regularly, but only about 30–50% of the time rather than 99% of the time like the above code.

Again, I feel like this must be an issue with the pthreads library running out of resources when threads aren't joined/destroyed quickly enough, but I don't know how to confirm that.

I ran your program under Valgrind, and it fails. Valgrind version 3.11.0 on Fedora 24, gcc 6.2.1. (Valgrind output too long to paste here as one comment.) ==7016== Thread 103: ==7016== Invalid read of size 1 ==7016== at 0x4E3FD67: pthread_create@@GLIBC_2.2.5 (pthread_create.c:700) ==7016== by 0x400EA4: fib (sss.c:95) ==7016== by 0x4E3F5C9: start_thread (pthread_create.c:333) ==7016== Address 0xe026d13 is not stack'd, malloc'd or (recently) free'd — Bjorn A., Oct 31 '16 at 20:06
Part 2: ==7016== Process terminating with default action of signal 11 (SIGSEGV) ==7016== Access not within mapped region at address 0xE026D13 ==7016== at 0x4E3FD67: pthread_create@@GLIBC_2.2.5 (pthread_create.c:700) ==7016== by 0x400EA4: fib (sss.c:95) ==7016== by 0x4E3F5C9: start_thread (pthread_create.c:333) — Bjorn A., Oct 31 '16 at 20:06
@BjornA. - Oh! That's interesting... I actually tried valgrind on an earlier version of this code, but apparently another bug in the code was prevented this segfault. Now I'm also seeing the segfault when running with valgrind. It looks like using valgrind on Linux perturbs the same schedule/resource issue that I'm seeing on OS X. — DaoWen, Oct 31 '16 at 20:55
C is not Java or C#. `volatile` does not prevent data races or torn accesses, you need `_Atomic` for that. — EOF, Nov 01 '16 at 00:20
@EOF - I'm well aware of that. `volatile` may not give those guarantees, but GCC's `__sync` atomic builtins do. If there's a specific line that you think is problematic on x86_64 with standard gcc/clang, then please point it out. — DaoWen, Nov 01 '16 at 00:53
There's a timing issue. Running in lldb, I see a seg fault in the 2nd call to `pthread_create` in the fib function, faulting at a call to `OSSpinLockLock`. I wonder if the problem lies in `__sync_fetch_and_add` based on the answer [here](http://stackoverflow.com/questions/6786284/gcc-atomic-built-in-functions) — TheDarkKnight, Nov 01 '16 at 10:12
@TheDarkKnight - What xcode version are you using? I'm just curious because I never see `OSSpinLockLock` in my backtraces. The question you link to is specifically about using a variable as the 2nd argument to `__sync_fetch_and_add`, whereas I'm using constants. Even if I had a weird race in my code, I still don't see how that could affect a call to `pthread_create` where all the arguments are local variables or constants—unless I have some global memory corruption going on somewhere (which, as far as I can tell, I don't). This really doesn't make sense to me, hence my posting here. — DaoWen, Nov 01 '16 at 15:50
The seg fault happens, but not always, which points towards a race condition / timing issue. I am not using Xcode, but if you mean which version of clang is in use, I have Xcode 8.1. As for crashing on calling `OSSpinLockLock`, you can see the lldb output [here](http://pastebin.com/DZDHwuCV). The pc points to the call after the one that has been executed, at the bottom of the disassembly (line 97). Considering a seg fault means a reference to memory outside of the process's virtual memory map is occurring, its either a bad pointer dereference, referencing released mem, or mem corruption. — TheDarkKnight, Nov 01 '16 at 16:53
@TheDarkKnight - Thanks for the trace. I'm not as familiar with lldb as I am with gdb, so I didn't know how to look at the disassembly. It's actually segfaulting just *after* the call to `OSSpinLockLock`. It looks like the call to `__bsdthread_create` is actually returning a bad pointer (in `%rax`, and copied to `%r13`), which is dereferenced on the next instruction after the call to `OSSpinLockLock`, and that causes the segfault... But I'm not sure what the jumps in there are doing, so I could be way off. Regardless, I can't find anything *in my code* that would be causing this problem... — DaoWen, Nov 01 '16 at 17:33
Not familiar with lldb; [this](http://lldb.llvm.org/lldb-gdb.html) will be useful. I'm not sure why you think it's after the call to OSSpinLockLock, as the disassembly I posted shows the InstructionPointer at line 97, which means it has just executed the call to OSSpinLockLock. Perhaps memory corruption occurred earlier on, so this is just the after effect. As a test, putting in a mutex around the `if` section of fib() may be interesting. — TheDarkKnight, Nov 02 '16 at 09:02
@TheDarkKnight - The disassembly shows the machine instructions near the current instruction address in memory, not a recent history of the instructions executed. The fact that it segfaults while executing the move just after the `call` to `OSSpinLockLock` means that either 1) the call just returned or 2) control jumped directly to that instruction (skipping the call). Since there are no nearby jumps with that target, I doubt that option #2 is the case. I did find a possible ABA problem with the CAS, and I added an assertion to catch that—but that doesn't seem to be the problem either... — DaoWen, Nov 02 '16 at 16:13

score 3 · Answer 1 · answered May 10 '17 at 17:23

3

I looked this for several hours because I wanted to know the solution.

What I found is the code is running over the stack and thread private data such that it is overwriting thread ids. The linked list in the code is pointing to and using the address of stack variables. The code only work because of the timing of the threads and the number of threads spawned.

If this spawns less than 20 or so threads then the linked list memory does not step on other data, it all comes down to how memory is laid out and threads are killed. So long as the program terminates before a thread that has been crushed awakes it is okay.

The reason it runs on linux and not OS X is probably luck combined with memory layout and the time spent spinning the usleep() loops.

The use of usleep in multithreaded applications should be reviewed.

This is greatly discusses in many sources:

https://computing.llnl.gov/tutorials/pthreads/#Overview

https://en.wikipedia.org/wiki/ABA_problem

along with W.R. Stevens, "Unix Network Program, Vol. 1" Chapter 23 specifically.

Reading this resources will explain why this code does not work and how it should work.

answered May 10 '17 at 17:23

Vanderdecken

195
1
9

Thanks for looking into this! I need you to clarify something though: *"What I found is the code is running over the stack and thread private data such that it is overwriting thread ids."* Are you saying that the threads' runtime stacks are overflowing and clobbering data, or that the pthread library's internal data structures are overflowing and clobbering other data? – DaoWen May 10 '17 at 19:00
@DaoWen I believe the linked list he is trying to implement is the problem as I looked at the list and it was showing the node memory blocks as being on the sp stack register address space. I did not hard enough at the stack address. I am not sure of the address relationship between the thread stacks and the main stack space to give a good answer. For some reason the pthread_id was being set to zero's during the call to pthread_create. Also there were random crashes that suggest a stack overwrite by rogue pointers. – Vanderdecken May 10 '17 at 20:37
*"... the node memory blocks as being on the sp stack register address space"* That's the expected behavior. [See line 59 in the sample program gist.](https://gist.github.com/DaoWen/c2bb8c1b7ae75791255a46e2a0682d94#file-fib-threads-c-L59) *"For some reason the pthread_id was being set to zero's during the call to pthread_create."* It's been a long time since I was trying to debug this, but I seem to remember that happening because of the bad return value from `__bsdthread_create`. The discussion in the comments has some more details. – DaoWen May 10 '17 at 21:16
The thread failure would make sense to return the 0. I did not think of that. – Vanderdecken May 11 '17 at 20:46

Why does this pthreads code consistently segfault on OS X but not on Linux?

My question is the following:

1 Answers1