Improve For Loop with Threads

Question

Lets say I have a for loop with 9000+ iterations, and I want to somehow improve it with threads, say 10.

Function Something(){

    for ( i = 0; i < 9000 ){
        DoStuff();
    }
}

What would be the best way to cover the 9000 iterations with my 10 threads? I'm currently working with C++99 and win32 pthreads, but I think this is a generic question.

Thanks in advance.

Edit: For this example, lets say DoStuff() handles heavy processing, independent from other iterations. Also, that there are shared resources but that those are covered with mutex variables.

just go through http://stackoverflow.com/questions/10792157/c-2011-stdthread-simple-example-to-parallelize-a-loop — AvmnuSng, Jun 11 '13 at 16:56
Here's a naively simple parallel_for object: http://coliru.stacked-crooked.com/view?id=aa07823fc8598b49c19a1321505f9a72-e54ee7a04e4b807da0930236d4cc94dc — Mooing Duck, Jun 11 '13 at 17:44
Have you considered using SIMD (i.e. SSE/AVX)? There are three ways to do parallelism with the CPU: Thread level parallelism (TLP), instruction level parallelism (ILP), and SIMD. AVX is eight wide for float (AVX2 for int as well). In fact I would recommend optimizing the code with SIMD (if possible) first before threading. — , Jun 12 '13 at 05:54

score 3 · Accepted Answer · answered Jun 11 '13 at 17:02

3

The answer REALLY depends on what DoStuff() actually does. If it's some large vector that you are multiplying with another large (or small) vector, then chopping it up into 10 sections is probably not that difficult. This works OK for any CPU intensive work where each calculation is independent of other calculations. Calculating the sum of all elements will also work OK, but you have to sum up a section, then store the result and when all threads finish, sum up the different sections.

There are also calculations which are completely useless to parallelize. Calculating Fibonacci numbers using the F(n) = F(n-1) + F(n-2) method won't work at all well in threads, since you need the the result of the previous step before you can calculate the current step.

If, on the other hand DoStuff is reading 10 million records from a single file, it's very unlikely that having more threads will help at all - since reading a file sequentially is a little faster than scattering reads all over the place, and the disk is MUCH slower than the processor, so you won't gain anything.

answered Jun 11 '13 at 17:02

Mats Petersson

126,704
14
140
227

1

If `DoStuff` is reading records from a file, splitting that into threads that will probably make it _much slower_ – Mooing Duck Jun 11 '13 at 17:07
Yes, that's what I was trying to say - although I was conservative in saying that it won't give any benefit, rather than saying it will be much slower. – Mats Petersson Jun 11 '13 at 17:12
Actually, it dawns on me that it _is_ possible to usefully calculate fibonacci in parallel. Thread one calculates F(100) and F(101), Thread two calculates F(201) _in terms of_ F(100) and F(101), and then when both are done, put the results of thread one into the function generated by thread two, and you get a final result. – Mooing Duck Jun 11 '13 at 17:17
@MooingDuck, no it won't. Threading will only make the calculation of Fibonacci numbers slower http://stackoverflow.com/questions/17002380/fibonacci-numbers-with-openmp-tasks – Jun 12 '13 at 05:48

score 0 · Answer 2 · answered Jun 11 '13 at 16:58

Depends a lot on what is inside DoStuff(). If the data in there is dependent upon other iterations, or accesses external data that is updated and must be shared across DoStuff() runs, then threads may even slow things down. If DoStuff() is able to run independent of itself and has its own place to store memory that doesn't conflict with other threads, and will take long enough to run to overcome the initial overhead of setting up the threads and joining them when complete, then create your 10 threads above the loop, run your code by putting say 900 iterations in each thread, and join/kill them when complete. Or use a thread pool construct and let it do it for you.

A generic answer for a generic question.

Lets say DoStuff() handles a heavy load of processing, and is independent from other iterations. It does use shared resources, but lets say it is all controlled with mutex variables. My initial approach was to divide the iterations between threads (send to each thread an initial and final iteration, until all iterations are covered by the total threads), and then join all threads. I was unsure if this was the best approach to solve this problem. Thanks for your answer. — Guj Mil, Jun 11 '13 at 17:10

score 0 · Answer 3 · answered Jun 11 '13 at 17:25

0

One approach would be to delegate portions of the loop to different threads. Let one thread handle the range 0-999, the second thread the range from 1000-1999, and so on. The pseudo code would be as follows:

Function Thread(int start, int count){

    for ( i = start; i < (start + count); ++i ){
        DoStuff();
    }
}

Function Something(){

    for ( i = 0; i < 9; ++i ){
        SpawnThread(Thread, (i * 1000), 1000);
    }

}

answered Jun 11 '13 at 17:25

Ziffusion

8,779
4
29
57

Or just use a compiler that, given the right command line switches, will generate code that chooses to do that automatically at runtime. Intel and Sun's C compilers do that for you, don't think GCC does. – bazza Jun 22 '13 at 06:24

score 0 · Answer 4 · answered Jun 11 '13 at 18:41

Based on your edit, it sounds to me like there may be a solution that doesn't involve explicit threads at all. You may well be able to use OpenMP to execute the code in parallel without doing the threading explicitly at all. This could be about as simple as something like:

Function Something(){

    #pragma omp parallel for // ...
    for ( i = 0; i < 9000 ){
        DoStuff();
    }
}

Here, the ... means you might need (or want) to add some more annotations there. For example, you can specify which variables will be shared, which will be independent for each thread, etc.

This may be a lot less glamorous than writing your own threading code, but it can be quite effective. In particular, the OpenMP run-time will normally have built-in code for determining how many threads to use based on available processor resources, so use of 10 threads wouldn't be explicit -- so in a couple of years when you have a machine with 16 cores, you wouldn't have to rewrite to take advantage of them.

At the same time, it's true that OpenMP has limitations. For the the situation you've described (executing loop iterations in parallel) it works quite nicely. It's not nearly as well suited to some other scenarios (e.g., creating an execution pipeline so one step of processing happens on one core, the next step on the next core, and so on).

Improve For Loop with Threads

4 Answers4