5

I found that sometimes it's faster to divide one loop into two or more

for (i=0; i<AMT; i++) {
    a[i] += c[i];
    b[i] += d[i];
}
     ||
     \/
for (i=0; i<AMT; i++) {
    //a[i] += c[i];
    b[i] += d[i];
}
for (i=0; i<AMT; i++) {
    a[i] += c[i];
    //b[i] += d[i];
}

On my desktop, win7, AMD Phenom(tm) x6 1055T, the two-loop version runs faster with around 1/3 time less time.

But if I am dealing with assignment,

for (i=0; i<AMT; i++) {
    b[i] = rand()%100;
    c[i] = rand()%100;
}

dividing the assignment of b and c into two loops is no faster than in one loop.

I think that there are some rules the OS use to determine if certain codes can be run by multiple processors.

I want to ask if my guess is right, and if I'm right, what are such rules or occasions that multiple processors will be automatically (without thread programming) used to speed up my programmes?

Mysticial
  • 464,885
  • 45
  • 335
  • 332
Robert Bean
  • 851
  • 2
  • 11
  • 21
  • 2
    It is a question about CPU cache.here is an article about cpu cache http://lwn.net/Articles/252125/ – MYMNeo Apr 02 '13 at 06:31
  • i believed that running single threaded application on multiple cores is not possible. however, here is a link that challenged my belief... http://www.axceleon.com/info/AxceleonIntelSolution_Profile.pdf – Kinjal Patel Apr 02 '13 at 06:33
  • Thanks for the links, I'm reading. – Robert Bean Apr 02 '13 at 06:38
  • The cache comment still applies to the second example, but I'm guessing that `rand()` is sufficiently slow for it not to matter? – Alex Chamberlain Apr 02 '13 at 06:39
  • possible duplicate of [Why is one loop so much slower than two loops?](http://stackoverflow.com/questions/8547778/why-is-one-loop-so-much-slower-than-two-loops) – Mysticial Apr 02 '13 at 06:53
  • 2
    Note that the example and the performance numbers in this question match those of the linked duplicate extremely closely. So while I'm not 100% sure it's the same cause, they do seem to be pointing in the same direction. So it's not because of automatic parallelization. It's about alignment and having too many access streams. The reason why you don't see a difference in the second case is because `rand()` is an expensive operation that covers up the initial problem. – Mysticial Apr 02 '13 at 06:54
  • I just tried removing the rand() part, and simply assign a and b array with 100, this time, dividing the loop into two even slows the programme. So, I think i'm wrong with the guess. I'm reading and learning about the cache and alias things brought up in the various questions, so that I can understand this. – Robert Bean Apr 02 '13 at 07:24
  • I think here the case is even stronger. In addition to the aliasing problems (which may be different on AMD processors), the Phenom you have has only 2-way L1 [cache associativity](http://en.wikipedia.org/wiki/CPU_cache#Associativity). So if all 4 of your arrays are large and freshly allocated, it is likely they will all align to the same "cache way" and fight for the same 2 two slots. – Mysticial Apr 02 '13 at 07:27

3 Answers3

4

It is possible that your compiler is vectorizing the simpler loops. In the assembler output you would see this as the compiled program using SIMD instructions (like Intel's SSE) to process larger chunks of data than one number a time. Automatic vectorization is a hard problem, and it's plausible that the compiler would not be able to vectorize the loop that updates both a and b at the same time. This could partially explain why breaking the complex loop into two would be faster.

In the "assignment" loops, each invocation to rand() depends on the output of the previous invocations, which means that vectorization is inherently impossible. Breaking the loop into two would not make it benefit from SIMD instructions like in the first case, so you wouldn't see it run any faster. Looking at the assembler code the compiler generates would tell you what optimizations the compiler performed and what instructions it used.

Even if the compiler is vectorizing the loop the program is not using more than one CPU or thread; there is no concurrency. What happens is that the one CPU that there is is capable of running the single thread of execution on multiple data points in parallel. The distinction between parallel and concurrent programming is subtle but important.

Cache locality might also explain why breaking the first loop into two makes it run faster, but not why breaking the "assignment" loop into two doesn't. It is possible that b and c in the "assignment" loop are sufficiently small so that they fit into the cache, which would mean that the loop already has optimal performance and breaking it further brings no benefit. If this were the case, making b and c larger would force the loop to start trashing the cache and breaking the loop into two would have the expected benefit.

Joni
  • 108,737
  • 14
  • 143
  • 193
  • I am not sure I follow the "just copying a block of memory" bit. – NPE Apr 02 '13 at 06:51
  • 1
    No, there's no copying at all. The use (or not) of SIMD instructions has nothing to do with this. Most compilers these days will prefer SIMD instructions over legacy floating point instructions. The Intel compiler is one of the few compilers that can vectorise code such as this so that it uses SIMD to its maximum potential. The rest will just use the scalar versions of those insructions. –  Apr 02 '13 at 07:05
  • Sorry, I misread the += as =, so the memory is not being copied in the first examples. – Joni Apr 02 '13 at 07:34
  • 1
    In the first case the compiler could be unsure if it can use SIMD to do multiple additions at a time. for example what if 'd' and 'a' point to the same memory? – Lefteris E Apr 02 '13 at 08:02
  • I've updated the answer to correct the misunderstanding and hopefully explain myself better :) – Joni Apr 02 '13 at 15:52
2

The optimization is done by the compiler (http://en.wikipedia.org/wiki/Loop_optimization). if you are using GCC, check this page http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html for the list of available optimization rules.

In another hand, see that you are using rand() function which consumes a lot of CPU time.

Bechir
  • 987
  • 10
  • 26
  • 1
    I agree that splitting the loop is an optimization (loop fission, http://en.wikipedia.org/wiki/Loop_fission) but are you sure the compiler is doing the optimization? The OP seemed to benefit by doing this optimization by hand (at least in the first example in the question) ... – maditya Apr 02 '13 at 06:42
  • The OP could be done by hand as in the user example. The compiler can do it (at least GCC supports it). For more detail about GCC optimization options, please refer to this page http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html – Bechir Apr 02 '13 at 07:19
  • From a low-level point of view, Mystical had explained this in details in the similar questions (sorry for the duplicate). Though I am currently unable to understand the cache and alias things. And Bechir's answer and maditya's comment do explain things in a higher-level, it's the fission and fusion of a loop. However, to understand why fission speeds up in my case, still, I need to learn Mystical's answer in the other post. Well, Thanks for all of you, I've got a tons of pages to read now :-) – Robert Bean Apr 02 '13 at 08:09
  • The main things Mystical pointed you at are "locality of reference in loop fission" (from http://en.wikipedia.org/wiki/Loop_transformation) in first case and that "rand() is an expensive operation that covers up the initial problem" in second. – SChepurin Apr 02 '13 at 08:35
  • Yes, I've been trying to learn why loop fission can improve locality of reference... Just so lucky I come to this place, now I know how so much more I still need to know to be a good programmer. – Robert Bean Apr 02 '13 at 08:47
  • After reading of the page that MYMNeo recommended, I think I got the idea now. CPU core load from Main Memory by cache, and the cache normally load each time the target data and its neibourings. If a and b array processed alternatively, (assuming only one cache line), then more cache loading and popping needed(hope they are the right words load , pop :) ). – Robert Bean Apr 03 '13 at 01:43
0

I want to ask if my guess is right, and if I'm right, what are such rules or occasions that multiple processors will be automatically (without thread programming) used to speed up my programmes?

No, the guess is not right. In all three cases the code is run on a single core.

It is for some other reason that splitting the first loop into two makes it faster. Perhaps your compiler is able to generate better code, or the CPU is having easier time prefetching the right data, etc. It is hard to tell without analysing the generated machine code.

NPE
  • 486,780
  • 108
  • 951
  • 1,012
  • I'm inclined to agree (although I'm way out of my depth here). But curiously, in this article (http://en.wikipedia.org/wiki/Loop_fission), which seems to be the technique that the OP is doing by hand, there is a statement that "[this] optimization is most efficient in multi-core processors that can split a task into multiple tasks for each processor". Are you saying the guess is not right always, or just for the particular processor the OP mentioned? – maditya Apr 02 '13 at 06:47
  • That bit of that article is wrong. Loop fission can be useful as a precursor to splitting the two loops into separate threads, but no multi-core processor I've ever seen can detect this on its own. –  Apr 02 '13 at 06:50