I have a task scheduling code that I want to compare with a baseline that basically creates a new pthread for each task (I know that's not a great idea, but that's why this is just the baseline for comparison). However, for some reason the pthreads version keeps giving me segfaults on OS X1, but when I try running the same code on Linux2, everything works fine.
On OS X, it occasionally completes successfully, but it usually segfaults in pthread_create
, and sometimes segfaults in pthread_join
instead. I also found that if I call pthread_create
supplying the PTHREAD_CREATE_DETACHED
attribute, and skip the pthread_join
s, then the segfault problems goes away.
Included at the bottom of this question is a stripped-down version of the code that I tried to minimize as much as possible while still causing the problematic segfaults.
My question is the following:
Why is this crashing on OS X, but not on Linux?
Maybe there's a bug that I'm overlooking that happens to be benign on Linux. I'm pretty certain that the mutex and CAS operations are providing sufficient synchronization, so I don't think this is a data race issue.
Like I said, I can work around this by using PTHREAD_CREATE_DETACHED
, but I'm really curious about the root cause of the segfaults. My feeling is that I'm currently overwhelming some system resources limit that isn't being freed fast enough when I require threads to be joined, but the problem is fixed for detached pthreads because they can immediately be destroyed when the thread exits; however, I'm not familiar enough with the pthread internals to confirm/refute my hypothesis.
Here's a general outline of how the code works:
We have a stack of pthreads (accessed through
wait_list_head
) that are currently blocked awaiting a signal on a thread-specific condition variable.The
main
thread creates a single child-thread, and then waits for all the transitive children to complete (by checking for theactive
thread counter to reach zero).The child thread computes
Fibonacci(N=10)
by creating two sub-threads to computeFibonacci(N-1)
andFibonacci(N-2)
, and then joins two threads, summing their results and returning that sum as its own result. That's how all the child threads work as well, with a base case ofN<2
just returningN
.Note that the blocked-thread-stack semi-randomizes which threads are joined by the parent threads. I.e., a parent thread might join the children of one of its siblings rather than joining its own children; however, the final sum will still be the same thanks to the commutativity of integer addition. Removing this "randomizing" behavior by having each parent join its own children also eliminates the segfaults.
There's also a simple purely recursive Fibonacci implementation (
pure_fib
) that is used to compute the expected answer for validation.
Here's a some pseudo-code of the core behavior:
Fibonacci(N):
If N < 2:
signal_parent(N)
Else:
sum = 0
pthread_create(A, Fibonacci, N-1)
pthread_create(B, Fibonacci, N-2)
sum += suspend_and_join_child(); // not necessarily thread A
sum += suspend_and_join_child(); // not necessarily thread B
signal_parent(sum)
The minimal working example of the C code is included below.
1 Apple LLVM version 7.0.0 (clang-700.1.76), Target: x86_64-apple-darwin14.5.0
2 gcc (Ubuntu 5.4.0-6ubuntu1~16.04.2) 5.4.0 20160609
#include <assert.h>
#include <pthread.h>
#include <inttypes.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include <unistd.h>
#define N 10
#define RCHECK(expr) \
do { \
int _rcheck_expr_return_value = expr; \
if (_rcheck_expr_return_value != 0) { \
fprintf(stderr, "FAILED CALL: " #expr "\n"); \
abort(); \
} \
} while (0);
typedef struct wait_state_st {
volatile intptr_t val;
pthread_t other;
pthread_mutex_t lock;
pthread_cond_t cond;
struct wait_state_st *next;
} wait_state;
wait_state *volatile wait_list_head = NULL;
volatile int active = 0;
static inline void push_thread(wait_state *ws) {
do {
ws->next = wait_list_head;
} while (!__sync_bool_compare_and_swap(&wait_list_head, ws->next, ws));
}
static inline wait_state *pop_thread(void) {
wait_state *ws, *next;
do {
ws = wait_list_head;
while (!ws) {
usleep(1000);
ws = wait_list_head;
}
next = ws->next;
} while (!__sync_bool_compare_and_swap(&wait_list_head, ws, next));
assert(ws->next == next); // check for ABA problem
ws->next = NULL;
return ws;
}
intptr_t thread_suspend(int count) {
intptr_t sum = 0;
// WAIT TO BE WOKEN UP "count" TIMES
for (int i = 0; i < count; i++) {
wait_state ws;
ws.val = -1;
ws.other = pthread_self();
RCHECK(pthread_mutex_init(&ws.lock, NULL));
RCHECK(pthread_cond_init(&ws.cond, NULL));
RCHECK(pthread_mutex_lock(&ws.lock));
push_thread(&ws);
while (ws.val < 0) {
RCHECK(pthread_cond_wait(&ws.cond, &ws.lock));
}
assert(ws.other != pthread_self());
pthread_join(ws.other, NULL);
sum += ws.val;
RCHECK(pthread_mutex_unlock(&ws.lock));
}
return sum;
}
void thread_signal(intptr_t x) {
// wake up the suspended thread
__sync_fetch_and_add(&active, -1);
wait_state *ws = pop_thread();
RCHECK(pthread_mutex_lock(&ws->lock));
ws->val = x;
ws->other = pthread_self();
RCHECK(pthread_cond_signal(&ws->cond));
RCHECK(pthread_mutex_unlock(&ws->lock));
}
void *fib(void *arg) {
intptr_t n = (intptr_t)arg;
if (n > 1) {
pthread_t t1, t2;
__sync_fetch_and_add(&active, 2);
RCHECK(pthread_create(&t1, NULL, fib, (void *)(n - 1)));
RCHECK(pthread_create(&t2, NULL, fib, (void *)(n - 2)));
intptr_t sum = thread_suspend(2);
thread_signal(sum);
}
else {
thread_signal(n);
}
return NULL;
}
intptr_t pure_fib(intptr_t n) {
if (n < 2) return n;
return pure_fib(n-1) + pure_fib(n-2);
}
int main(int argc, char *argv[]) {
printf("EXPECTED = %" PRIdPTR "\n", pure_fib(N));
assert("START" && wait_list_head == NULL);
active = 1;
pthread_t t;
RCHECK(pthread_create(&t, NULL, fib, (void *)N));
while (active > 0) { usleep(100000); }
intptr_t sum = thread_suspend(1);
printf("SUM = %" PRIdPTR "\n", sum);
printf("DONE %p\n", wait_list_head);
assert("END" && wait_list_head == NULL);
return 0;
}
Update: This Gist contains a slight variation of the above code that uses a global mutex for all the thread push/pop operations, and thus avoids the possible ABA problem with the CAS above. This version of the code still regularly segfaults regularly, but only about 30–50% of the time rather than 99% of the time like the above code.
Again, I feel like this must be an issue with the pthreads library running out of resources when threads aren't joined/destroyed quickly enough, but I don't know how to confirm that.