3

Somebody asked me a question: Which one is the fastest among the below two scenario:

Case 1: assume int count = 0;

for (int i = 0; i < 10; i++)
{
    for (int j = 0; j < 5; j++)
    {
        count++;
    }
}

Case 2: assume int count = 0;

for (int i = 0; i < 5; i++)
{
    for (int j = 0; j < 10; j++)
    {
        count++;
    }
}

In both the scenario's the final valuse of count will be 50. But I am not sure which one will be faster? I think CASE II is faster but not sure...

It will be great if anyone can throw some light on it. which is faster and why?

Jatin
  • 1,857
  • 4
  • 24
  • 37
  • 6
    Probably exactly the same after the compiler is done with the code. The only way to know is to measure it. – juanchopanza Aug 16 '12 at 12:02
  • 7
    Both of these trivial loops are likely to be replaced with `count = 50` by the compiler. – Maxim Egorushkin Aug 16 '12 at 12:02
  • I agree. Most likely neither are measurably faster and they may get optimized to the same exact code depending on the compiler. – drescherjm Aug 16 '12 at 12:03
  • 4
    **What is the purpose of the question?** Is it a simplified production code? Is it a HW question? Is it for educational purposes? Providing an answer for this will yield better focused answers. – amit Aug 16 '12 at 12:05
  • 2
    You should have benchmarked both loops and posted your results as part of this question. As it stands, it does not make any sense. – Björn Pollex Aug 16 '12 at 12:06
  • @AMIT: I asked this question so that I can clear my doubt about the looping concepts. If compiler is going to take different time for both the cases, then we can certainly speed-up the computation by using the method which gives us faster computation. – Jatin Aug 16 '12 at 12:09
  • In this link: http://stackoverflow.com/questions/2550827/efficiency-of-nested-loo , the user is trying to print the time taken by each cases(in JAVA), I dont know what API will allow to do me the same thing in C/C++/ So that I can test by myself the time it takes for both the cases. – Jatin Aug 16 '12 at 12:12
  • A Down-vote just for trying to clear my doubt !!!!.... :-( – Jatin Aug 16 '12 at 12:14
  • @VikasChhipa For a simple loop as you show in the example it really doesn't matter. As already stated, the compiler just sets count = 50. Loop execution speed will depend on what is in the body of the loop in more complicated cases. – Wes Aug 16 '12 at 12:14
  • @WES: okay..i understood that in case of simply incrementing a counter variable inside the body, the compiler just does optimization and gives 50. But which one will be faster if the body of the loop is doing something complex computations. which one will be the faster, and why? – Jatin Aug 16 '12 at 12:22
  • @VikasChhipa If the loop is doing any complex computations, the time spent in loop management will be negligible, so who cares. – James Kanze Aug 16 '12 at 12:43
  • @James: I agree sir that WHO CARES if we are already doing some complex computation. But...I've to tell either case 1 or 2 along with my reason. :-( Anywayz...thanks for the information :-) – Jatin Aug 16 '12 at 12:49
  • If you are using some buffer onside (like image) going first by rows and then by columns will be significantly faster. – Dimitar Slavchev Aug 16 '12 at 14:18

5 Answers5

10

this is the only example i can think of where it matters which variable you iterate where

int array[n][m];
//fast
for (int i=0;i<n;i++)
{ for(int j=0;j<m;j++)
  { count+=array[i][j];
  }
}
//slow
for (int i=0;i<m;i++)
{ for(int j=0;j<n;j++)
  { count+=array[j][i];
  }
}

the second one is slower because you don't iterate the locations in memory one after the other, but because you jump by m locations at a time. the processor caches the memory locations positioned immediately after an accessed location.

titus
  • 5,512
  • 7
  • 25
  • 39
  • 1
    More info on this subject can be found in [this thread](http://stackoverflow.com/questions/9888154/which-of-these-two-for-loops-is-more-efficient-in-terms-of-time-and-cache-perfor) – amit Aug 16 '12 at 12:08
  • Note that unlike the original question, this deals with access to memory orders. The difference in performance is not because the outer/inner loops iterate more times but the order in which the elements inside the 2D matrix are accessed. – David Rodríguez - dribeas Aug 16 '12 at 12:22
  • This answers a different question. @Vikas is asking about the effects of the loops themselves, not memory access within them. – hcarver Aug 16 '12 at 12:35
4

Okay, tested this on my system. With full optimization, the compiler just made count = 50, with no questions asked. Without optimization, the second version usually was the slightest bit faster, but it was completely negligible.

The disassembly: Both loops have the precisely same code, except the compares are once with 100, once with 50 (I buffed the numbers up a bit to allow for longer execution time)

    for(int i = 0; i< 100; i++) {
00F9140B  mov         dword ptr [i],0  
00F91412  jmp         main+5Dh (0F9141Dh)  
00F91414  mov         eax,dword ptr [i]  
00F91417  add         eax,1  
00F9141A  mov         dword ptr [i],eax  
00F9141D  cmp         dword ptr [i],64h  
00F91421  jge         main+88h (0F91448h)  

        for(int j = 0; j< 50; j++)
00F91423  mov         dword ptr [j],0  
00F9142A  jmp         main+75h (0F91435h)  
00F9142C  mov         eax,dword ptr [j]  
00F9142F  add         eax,1  
00F91432  mov         dword ptr [j],eax  
00F91435  cmp         dword ptr [j],32h  
00F91439  jge         main+86h (0F91446h)  
        {
            count++;
00F9143B  mov         eax,dword ptr [count]  
00F9143E  add         eax,1  
00F91441  mov         dword ptr [count],eax  
        }
00F91444  jmp         main+6Ch (0F9142Ch)  
    }
00F91446  jmp         main+54h (0F91414h)  

The only difference between big loop outside, small loop inside, and small loop inside, and big loop outside is how often you have to do the jump from

00F91439  jge         main+86h (0F91446h)  
to
00F91446  jmp         main+54h (0F91414h)  

And the initialization for the loop variables:

00F91423  mov         dword ptr [j],0  
00F9142A  jmp         main+75h (0F91435h)  

for every new loop, while skipping below part.

00F9142C  mov         eax,dword ptr [j]  
00F9142F  add         eax,1  
00F91432  mov         dword ptr [j],eax  

Additional commands with each iteration of the inner loop: mov, add, mov, but no mov / jmp

Additional commands for each inner loop initialized: mov, jmp, and more often getting the JGE true.

Thus if you run the inner loop 50 times, you will have that JGE only come true 50 times, and thus do 50 jumps there, while with the inner loop running 100 times, you will have to jump 100 times. That's the ONLY difference in the code. With this case it's hardly any difference, and most of the times you will run into your memory access being the thing causing a slowdown a LOT more than your loop ordering. Only exception: if you know you can order your loops properly to avoid branch prediction. So two things are worthy of ordering your loop one way or the other:

-memory access

-branch prediction

For everything else the impact is completely negligible.

SinisterMJ
  • 3,425
  • 2
  • 33
  • 53
  • Thanks for testing it and letting me know about the result. I have understood that if compiler is allowed optimizing, then it will replace the nested for loops with count = 50. But I am more interested in knowing the reason for the case where no compiler optimization happens....and as you said, CASE II turns out to be the faster case. Can you please tell me WHY 2nd CASE is faster? – Jatin Aug 16 '12 at 12:27
  • Edited my original answer to better represent why one case is faster, and when to actually really worry about something like this. – SinisterMJ Aug 16 '12 at 21:31
1

Actually, it is in general not a good idea to try to optimize yourself for such details, because mostly the compiler is much better at it (as long as the algorithm is not changed).

The number of loops is equal.

It might be that the second is a bit faster, because the initialization of the second loop only happens 5 times instead of 10, but I doubt that would really gain some noticable change.

The best way is to use a profiler or even better: analyze the generated assembly code.

Michel Keijzers
  • 15,025
  • 28
  • 93
  • 119
1

If you really want to know, make your outer loop a function with a single parameter of the maximum loop count (5 or 10), your inner loop a function which takes a single parameter of the maximum inner loop count (10 or 5), then compile without any optimisation, run both a million times, and time them.

With any optimisation at all, your compiler will inline the functions, and expand the loops, and calculate count as part of the optimisation process. My guess is that they'll do exactly the same amount of work, and you'll see exactly the same times.

hcarver
  • 7,126
  • 4
  • 41
  • 67
  • 2
    Timing unoptimized code is like timing Usain Bolt walking - he's not fast at all?! – Bo Persson Aug 16 '12 at 12:07
  • @BoPersson I totally agree. But if the questioner really wants to know, this will at least let them find out. Plus if the 5 and 10 weren't hard-coded, the answer is less trivial. Timing the unoptimised code would let us guess at whether we could beat Bolt in a marathon (we probably still couldn't but it's fun to find out). – hcarver Aug 16 '12 at 12:13
  • My point was that the info is uninteresting. If you use option `-O0` to tell the compiler "don't make the code fast", then why measure it. – Bo Persson Aug 16 '12 at 12:23
  • The only reason the compiler can optimise the loops because the `5` and `10` are known at compile-time. If they're not known, then it's a potentially interesting result to know which loop ordering is faster. I *think* what I describe in the answer is a way of finding it out. – hcarver Aug 16 '12 at 12:27
  • At a guess, the only cost that changes will be loop initialisation, so you want the longer-count loop inside the smaller-count loop. This is slightly different to what I said in the answer, but I've had more time to think about it now :) – hcarver Aug 16 '12 at 12:28
  • @Bo Persson: please don't go by hard-coded values of 5 and 10. My intent to know which nested loop will be faster??? X & Y values can be anything in millions and are not known at compile time. – Jatin Aug 16 '12 at 12:32
  • @Vikas - There isn't a general answer, compilers are **much** smarter than most people believe and can do huge transformations on the code. See this [other answer](http://stackoverflow.com/a/11639305/597607) where 10+ lines of function calls are inlined to just 5 machine instructions. – Bo Persson Aug 16 '12 at 12:38
  • 1
    @BoPersson True but not necessarily relevant -- there's no way for a compiler to unroll loops where the loop limit isn't known at compile time. His trivial example could well be optimised to `count += max(0, ij)`, but his question is an interesting one if the outer loop does a small calculation and the contents of the inner loop are non-trivial. You can imagine some pretty minor modifications to the question and you get a situation where the code can't be optimised and the answer isn't as obvious as many posters seem to have assumed. – hcarver Aug 16 '12 at 12:43
  • @Hbcdev Not just loop initialization. For n, m (where m is the count of the inner loop), you do `n*m` tests in the inner loop, and `n` initializations and `n` tests in the outer loop. If there is a very great difference in the two, *and* you the loop operation is very small, *and* the locality issues mentioned elsewhere don't intervene, then using the smaller value for the outer loop might be marginally faster. (In practice, I doubt you'll encounter a real case where it makes a difference. Unlike the locality issues.) – James Kanze Aug 16 '12 at 12:49
1

Have you actually tried it?

I would assume theoretically Case II would be marginally faster since there would only be half of the stack-based variable creation / destruction in the outer loop than there would be for Case I, but even better (and clearer) would be:

for(int k = 0; k < 50; k++)
{
     count++;
}

To be precise, your examples are so abstracted that the answer is probably of little use. It very much depends on the context.

Component 10
  • 10,247
  • 7
  • 47
  • 64