I'm working on a program that sends messages between threads, it looks at which threads are busy, if one is free it grabs the first free one(or in some cases multiple free ones), marks it as taken, sends work to it and does it's own work, then once finished waits for it to complete. The part that is the bottleneck of all of this is coordinating between threads about which thread is taken. Seems like a problem I'm sure others have encountered, have some solutions to share, but also want to know if you can do better than me.
My solution ultimately boils down to: Maintain a set representing indexes of free threads, and be able to grab an item from the set getting the index of a free thread or add it back to the set increasing the size by one. Order unimportant. I know the fixed size of the set in advance.
I've tried a few ways of doing this:
Maintain a single unsigned long long int and use '__builtin_clz'(Interesting __builtin_ffsll was 10x slower.. thinking not supported with a single instruction on my processor) to count the number of bits in a single instruction cycle and grab the lowest one and use a lookup table of bitmasks to flip bits on and off, simultaneously claiming their thread number. Loved this version because I only needed to share a single atomic unsigned long long and could use a single atomic operation but doing 'fetch_and' in a loop until you are right ended up slowing than locking and doing non-atomically. The version using locking ended up being faster, probably because threads didn't get stuck in loops repeating the same operations waiting for others to finish theirs.
Use a linked list, allocate all nodes in advance, maintain a head node and a list, if pointing to nullptr, then we've reached the end of the list. Have only done this with a lock because it needs two simultaneous operations.
Maintain an array that represents all indexes of threads to claim. Either increment an array index and return previous pointer to claim a thread, or swap the last taken thread with the one being freed and decrement the pointer. Check if free.
Use the moodycamel queue which maintains a lock free queue.
Happy to share C++ code, the answer was getting to be quite long though when I tried to include it.
All three are fast, __builtin_clzll is not universally supported, so even though a little faster, probably not enough so to be worth it and probably 10x slower on computers that don't natively support it, similar to how __builtin_ffsll was slow. Array and linked list are roughly as fast as each other, array seems slightly faster when no contention. Moody is 3x slower.
Think you can do better and have a faster way to do this? Still the slowest part of this process, still just barely being worth the cost in some cases.
Thoughts for directions to explore:
- Feels like there should be a way using a couple of atomics, maybe an array of atomics, one at a time, have to maintain the integrity of the set with every operation though, which makes this tricky. Most solutions at some point need two operations to be done simultaneously, atomics seem like they could provide a significantly faster solution than locking in my benchmarking.
- Might be able to use lock but remove the need to check if the list is empty or swap elements in array
- Maybe use a different data structure, for example, two arrays, add to one while emptying the other, then switch which one is being filled and which is emptied. This means no need to swap elements but rather just swap two pointers to arrays and only when one is empty.
- Could have threads launching threads add work to a list of work to be done, then another thread can grab it while this thread keeps going. Ultimately still need a similar thread safe set.
- See if the brilliant people on stackoverflow see directions to explore that I haven't seen yet :)