Activate threads from the slowest or from the faster?

Question

I have an application on Linux on an i7 using boost::thread, and currently I have about 10 threads (between the 8 cores) that run simultaneously doing image processing on images of sizes of approximately 240 x 180 to 960 x 720, so naturally the smaller images finish faster than the larger images. At some point in the future I may need to bump up the number of threads so there will definitely be many more threads than cores.

So, is there a rule of thumb to decide which order to start and wait for threads; fastest first to get the small tasks out of the way, or slowest first to get it started sooner so finished sooner? Or is my synchronisation code wonky?

For reference, my code is approximately this, where the lower-numbered threads are slower:

Globals

static boost::mutex                  mutexes[MAX_THREADS];
static boost::condition_variable_any conditions[MAX_THREADS];

Main thread

// Wake threads
for (int thread = 0; thread < MAX_THREADS; thread++)
{
    mutexes[thread].unlock();
    conditions[thread].notify_all();
}
// Wait for them to sleep again
for (int thread = 0; thread < MAX_THREADS; thread++)
{
    boost::unique_lock<boost::mutex> lock(mutexes[thread]);
}

Processing thread

static void threadFunction(int threadIdx)
{
    while(true)
    {
        boost::unique_lock<boost::mutex> lock(mutexes[threadIdx]);

        // DO PROCESSING

        conditions[threadIdx].notify_all();
        conditions[threadIdx].wait(lock);
    }
}

Why do you expect that using more threads than cores would favorably affect performance? — IInspectable, Dec 16 '15 at 00:36
@IInspectable Actually that depends on the application: disk access for instance makes the thread wait a lot, and having other threads working during that wait time is beneficial. However if it's pure CPU work, as it could be the case here, multiplying the threads may have the opposite effect. — Déjà vu, Dec 16 '15 at 00:58
Why you don't just use a `boost::asio::io_service` (a threadpool implementation from boost::asio)? You could dispatch your image processing tasks on a pool with max hardware threads which would be more efficiant. — Naios, Dec 16 '15 at 01:14
Performing a `notify_all` then `wait`ing immediately after seems... odd. Usually, in a producer/consumer pattern, a given condition is notified by one set of threads, and waited on by another; having one thread both notify and wait on the same condition is nonsensical. For example, you might have a bounded queue, and when it's full, the producers `wait` on the `not_full` condition; when a consumer takes an item, it notifies the same condition. Similarly, when the queue is empty, the consumers `wait` on `not_empty`, and when a producer adds a value, it notifies on same. — ShadowRanger, Dec 16 '15 at 01:16
@ringø: Even in case threads are waiting for disk accesses to run to completion, scheduling more threads than you have cores isn't recommended. This will trash cache coherence, and slow down everyone else. A better solution would be to prefetch data, and schedule no more threads than there are cores to execute them. — IInspectable, Dec 16 '15 at 01:27
@IInspectable: Perhaps I didn't explain myself well enough! I want to avoid degrading performance too much. Note the tasks are pure compute tasks, no disk IO. — Ken Y-N, Dec 16 '15 at 01:42
You already stopped caring about performance when you decided to use *n + x* threads (where *n* is the number of cores, and *x* is a non-zero, positive value). — IInspectable, Dec 16 '15 at 01:46
@ShadowRanger: So, a preferred pattern would be one consumer per core, the producer places all jobs in the queue, then whichever consumer is idle can pick up a waiting job? That sounds much better to me. (The code I inherited actually creates 64 threads but only uses 10) — Ken Y-N, Dec 16 '15 at 01:47
@KenY-N: Yeah, that's one option. Producer reads data, puts appropriate chunks in the queue, consumers pull chunks when they finish their last task. You need locks and conditions to do it properly (so threads aren't busy waiting for work if the queue empties, nor do you exhaust main memory populating an unbounded queue too quickly), but it's a fairly common pattern. Using an existing thread pool would likely save you some time though; thread pools typically are already organized around workers executing tasks, and would handle a lot of this for you. — ShadowRanger, Dec 16 '15 at 01:52
@IInspectable: That's true if and only if the threads are *completely* compute-bound--but in this context, even reading from main memory can easily count as being I/O bound. In such a case, having at least a few extra threads can let a core do something useful while a thread waits for data from memory. Of course, that's generally only useful if you have something like hyperthreading that lets you switch to another thread very quickly--a full kernel-switch generally takes too long to gain much in such a situation. — Jerry Coffin, Dec 16 '15 at 02:17
Just adding fuel to the fire, I answered http://stackoverflow.com/q/16268469/1553090 a while back, which was a case where performance was increased by upping the thread count and doing less in each thread. — paddy, Dec 16 '15 at 02:28

Ken Y-N · Answer 1 · 2015-12-16T07:10:26.360

Thanks to the the hints from the commenters and much Googling, I've completely reworked my code, which seems to be slightly faster without the mutexes.

Globals

// None now

Main thread

boost::asio::io_service ioService;
boost::thread_group threadpool;
{
    boost::asio::io_service::work work(ioService);
    for (size_t i = 0; i < boost::thread::hardware_concurrency(); i++)
        threadpool.create_thread(boost::bind(&boost::asio::io_service::run, &ioService));

    for (int thread = 0; thread < MAX_THREADS; thread++)
        ioService.post(std::bind(threadFunction, thread));
}
threadpool.join_all();

Processing thread

static void threadFunction(int threadIdx)
{
    // DO PROCESSING
}

(I've made this a Community Wiki as it's not really my answer.)

Activate threads from the slowest or from the faster?

Globals

Main thread

Processing thread

1 Answers1

Globals

Main thread

Processing thread