Optimal size of a parallel for loop body

Question

suppose you have a parallel-for loop implementation e.g. ConcRT parallel_for, is it allways best to put all work inside one for loop body?

Take the following example:

for(size_t i = 0; i < size(); ++i)
{
    DoSomething(a[i], b[i]);
}
for(size_t i = 0; i < size(); ++i)
{
    DoSomethingElse(a[i], b[i]);
}

compared with

for(size_t i = 0; i < size(); ++i)
{
    DoSomething(a[i], b[i]);
    DoSomethingElse(a[i], b[i]);
}

the second variant would be the obvious way to go, but when it comes to parallel processing there might be other considerations?

I just had the case option 1 was faster than the second (~30ms to ~38ms on average) with parallel_for's. But I'm not good in the matter of benchmarking parallel algorithms, so maybe I measured wrong. Anyway, unfortunately I can not post the actual code example for this observation.

Are there some rules of thumb, additional considerations or just try and benchmark?

I think the second option has better locality, if `DoSomething` is not too messy. — Elazar, May 26 '13 at 17:47
Looks like your trying to second-guess the compiler. Your issue is compiler dependent, especially optimization levels. Some compilers may be smart enough to recognizing the parallelism opportunity in the second example, some not. Research more about your compiler and how to help it recognize parallel code fragments. Perhaps a `#pragma` is involved? — Thomas Matthews, May 26 '13 at 17:56
Related: http://stackoverflow.com/questions/8547778/why-is-one-loop-so-much-slower-than-two-loops/8547993#8547993 — Stephan Dollberg, May 26 '13 at 21:28

score 0 · Answer 1 · answered May 26 '13 at 17:51

It's every much depend on what you do in DoSomething and DoSomethingElse

Let's say that DoSomething need something from the memory, so when you run it in it's on loop, the object will be in cache, but when you switch from DoSomethin to DoSomethingElse, which also need something from the memory, the object in the cache changes and need to bring it from the memory.

Again -not sure this is the case very much depend on what you are doing in those methods. From first sight - there shouldn't be any different in performance

Optimal size of a parallel for loop body

1 Answers1