I've got a custom implementation of detours on macOS and a test application using it, which is written in C, compiled for macOS x86_64, running on an Intel i9 processor.
The implemention works fine with a multitude of functions. However, if I detour pthread_create
, I encounter strange behaviour: threads that have been spawned via a detoured pthread_create do not execute instructions. I can step through instructions one by one but as soon as I continue
it does not progress. There are no mutexes or synchronisation involved and the result of the function is 0 (success). The exact same application with detours turned off works fine so it's unlikely to be the culprit.
This does not happen all the time - sometimes they are fine but at other times the test applications stalls in the following state:
(lldb) bt all
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
* frame #0: 0x00007fff7296f55e libsystem_kernel.dylib`__ulock_wait + 10
frame #1: 0x00007fff72a325c2 libsystem_pthread.dylib`_pthread_join + 347
frame #2: 0x0000000100001186 DetoursTestApp`main + 262
frame #3: 0x00007fff7282ccc9 libdyld.dylib`start + 1
frame #4: 0x00007fff7282ccc9 libdyld.dylib`start + 1
thread #2
frame #0: 0x00007fff72a2cb7c libsystem_pthread.dylib`thread_start
Relevant memory pages have the executable flag set. The detour function that intercepts the thread creation looks like this:
static int pthread_create_detour(pthread_t* thread,
const pthread_attr_t* attr,
void* (*start_routine)(void*),
void* arg)
{
detour_count++;
pthread_fn original = (pthread_fn)detour_original(dlsym((void*)-1, "pthread_create"));
return original(thread, attr, start_routine, arg);
}
Where detour_original
retrieves the pointer to [original function + size of function's prologue].
Tracing through the instructions, everything seems to be working correctly and pthread_create
terminates successfully. Tracing the application's system calls via dtruss
does show calls to
bsdthread_create(0x10DB964B0, 0x0, 0x7000080DB000) = 29646848 0
With what I have confirmed are the correct arguments.
This behaviour is only observed in release builds - debug works fine but the disassembly and execution of a detoured pthread_create
and associated detours code seems to be identical in both cases.
Workarounds
I found a couple of odd workarounds for this issue that don't make much sense. Given the detour function, a number of things can be substituted into the following:
static int pthread_create_detour(pthread_t* thread,
const pthread_attr_t* attr,
void* (*start_routine)(void*),
void* arg)
{
detour_count++;
pthread_fn original = (pthread_fn)detour_original(dlsym((void*)-1, "pthread_create"));
<...> <== SUBSTITUTE HERE
return original(thread, attr, start_routine, arg);
}
- A cache flush.
__asm__ __volatile__("" ::: "memory");
_mm_clflush(real_pthread_create);
- A sleep of any duration -
usleep(1)
- A
printf
statement. - A memory allocation larger than 32768 bytes, e.g.
void *data = malloc(40000);
.
Cache?
All of these seem to point to a stale instruction cache. However, the Intel manual states the following:
A write to a memory location in a code segment that is currently cached in the processor causes the associated cache line (or lines) to be invalidated. This check is based on the physical address of the instruction. In addition, the P6 family and Pentium processors check whether a write to a code segment may modify an instruction that has been prefetched for execution. If the write affects a prefetched instruction, the prefetch queue is invalidated. This latter check is based on the linear address of the instruction.
What's even more interesting is that those workarounds have to be executed for every new thread created, with the execution happening on the main thread, so it's very unlikely to be the cache. I have also tried putting in cache flushes at every memory write that writes instructions but that did not help. I've also written a memcpy
that bypasses the cache with the use of Intel's intrinsic _mm_stream_si32
and swapped it out for every instruction memory write in my implementation without any success.
Race condition?
The next suspect in line is a race condition. However, it's not clear what would be racing as at first there are no other threads. I have put in a fibonacci sequence calculation for a randomly-generated number and that would still stall the newly-spawned threads.
The question
What is causing this issue? What other mechanisms could be responsible for this?
At this point I have run out of things to check so any suggestions will be welcome.