3

I'm working on a program that sends messages between threads, it looks at which threads are busy, if one is free it grabs the first free one(or in some cases multiple free ones), marks it as taken, sends work to it and does it's own work, then once finished waits for it to complete. The part that is the bottleneck of all of this is coordinating between threads about which thread is taken. Seems like a problem I'm sure others have encountered, have some solutions to share, but also want to know if you can do better than me.

My solution ultimately boils down to: Maintain a set representing indexes of free threads, and be able to grab an item from the set getting the index of a free thread or add it back to the set increasing the size by one. Order unimportant. I know the fixed size of the set in advance.

I've tried a few ways of doing this:

  1. Maintain a single unsigned long long int and use '__builtin_clz'(Interesting __builtin_ffsll was 10x slower.. thinking not supported with a single instruction on my processor) to count the number of bits in a single instruction cycle and grab the lowest one and use a lookup table of bitmasks to flip bits on and off, simultaneously claiming their thread number. Loved this version because I only needed to share a single atomic unsigned long long and could use a single atomic operation but doing 'fetch_and' in a loop until you are right ended up slowing than locking and doing non-atomically. The version using locking ended up being faster, probably because threads didn't get stuck in loops repeating the same operations waiting for others to finish theirs.

  2. Use a linked list, allocate all nodes in advance, maintain a head node and a list, if pointing to nullptr, then we've reached the end of the list. Have only done this with a lock because it needs two simultaneous operations.

  3. Maintain an array that represents all indexes of threads to claim. Either increment an array index and return previous pointer to claim a thread, or swap the last taken thread with the one being freed and decrement the pointer. Check if free.

  4. Use the moodycamel queue which maintains a lock free queue.

Happy to share C++ code, the answer was getting to be quite long though when I tried to include it.

All three are fast, __builtin_clzll is not universally supported, so even though a little faster, probably not enough so to be worth it and probably 10x slower on computers that don't natively support it, similar to how __builtin_ffsll was slow. Array and linked list are roughly as fast as each other, array seems slightly faster when no contention. Moody is 3x slower.

Think you can do better and have a faster way to do this? Still the slowest part of this process, still just barely being worth the cost in some cases.

Thoughts for directions to explore:

  • Feels like there should be a way using a couple of atomics, maybe an array of atomics, one at a time, have to maintain the integrity of the set with every operation though, which makes this tricky. Most solutions at some point need two operations to be done simultaneously, atomics seem like they could provide a significantly faster solution than locking in my benchmarking.
  • Might be able to use lock but remove the need to check if the list is empty or swap elements in array
  • Maybe use a different data structure, for example, two arrays, add to one while emptying the other, then switch which one is being filled and which is emptied. This means no need to swap elements but rather just swap two pointers to arrays and only when one is empty.
  • Could have threads launching threads add work to a list of work to be done, then another thread can grab it while this thread keeps going. Ultimately still need a similar thread safe set.
  • See if the brilliant people on stackoverflow see directions to explore that I haven't seen yet :)
mczarnek
  • 1,305
  • 2
  • 11
  • 24
  • 2
    Have you considered inverting the problem: [work-stealing task schedulers](https://en.wikipedia.org/wiki/Work_stealing), fork--join paralellism, or just plain threadpools? – Botje Oct 06 '20 at 14:47
  • Work stealing is indeed an interesting idea to be explored further, still need to add/remove work from a similar set and has the same problem. Effectively this is a threadpool, I'm just making my own, have investigated how they do it.. these are faster than what I found. Interesting, thought fork-join spawned an entire new thread.. however, what I would really like to do is use real threads that truly run in parallel, which is what my methods allow. Will look into work stealing though.. would allow threading in cases where threads aren't available this moment but will be very soon. – mczarnek Oct 06 '20 at 15:24
  • 1
    The crucial difference between work stealing and a thread pool is that worker threads "keep work for themselves" and only pull from a central pool or steal from other threads once they run out of work. That means there is drastically less inter-thread communication needed. This can be improved even further if the work queues are [lock-free](https://www.baeldung.com/lock-free-programming). – Botje Oct 06 '20 at 15:57
  • Still need some locking there when grabbing from and adding to a worker queue, right? Might be a way around that. I can definitely see advantages to this method and that's an interesting point. Any reasons you know of to use a central pool vs iterating through and checking the worker threads when looking for work? Love the idea of using a unsigned long long plus clz and threads mark whether they have work left by flipping their own bit atomically, then grab work with clz. But hardware support is an issue. – mczarnek Oct 06 '20 at 16:15
  • No, a lock-free queue is lock-free. [Here](https://github.com/cameron314/concurrentqueue) is the first implementation I found on GitHub. From a cursory glance it requires that `std::atomic` is lock-free. Be sure to check the linked blog post for details and benchmarks! – Botje Oct 06 '20 at 16:21
  • As for your question about work stealing: I vaguely recall a paper that proved that work-stealing is close to optimal in terms of core utilization, but it strongly depends on how large your work item are and how often you can split your work item in smaller parts. – Botje Oct 06 '20 at 16:23
  • Makes sense, I do love the idea of work stealing. And I know about that one, see moodycamel queue (number 4 in my list of solutions.. it's a good one, have looked through code, significantly slower than locking under contention). My gut is telling me there is an even faster solution, especially with work stealing where I only need to lock the head, but I could be wrong :) – mczarnek Oct 06 '20 at 16:38

1 Answers1

3

All you need is a thread pool, a queue (a list, deque or a ring buffer), a mutex and a condition_variable to signal when a new work item has been added to the queue.

Wrap work items in packaged_task if you need to wait on the result of each task.

When adding a new work item to the queue, 1) lock the mutex, 2) add the item, 3) release the lock and 4) call cv::notify_one, which will unblock the first available thread.


Once the basic setup is working, if tasks are too fine-grained, work stealing can be added to the solution to improve performance. It's also important to use the main thread to perform part of the work instead of just waiting for all tasks to complete. These simple (albeit clunky) optimizations often result in >15% overall improvement in performance due to reduced context switching.

Also don't forget to think about false sharing. It might be a good idea to pad task items to 64 bytes, just to be on the safe side.

rustyx
  • 80,671
  • 25
  • 200
  • 267
  • Using a pool to track future work instead of tracking currently free threads does seem like a really good idea, good thing I posted here. Will investigate package_task. Essentially what I'm trying to do is optimize the add to/remove from queue step and come up with a more efficient method for it as it is the bottleneck (because it requires locking). Still would like to improve efficiency here. Additionally, have explored conditional variables but kept running into ABA issue that is resolved by work queue over thread queue. – mczarnek Oct 06 '20 at 15:39