14

Possible Duplicate:
Is it faster to count down than it is to count up?

which loop has better performance? I have learnt from some where that second is better. But want to know reason why.

  for(int i=0;i<=10;i++)
      {
               /*This is better ?*/
      }


  for(int i=10;i>=0;i--)
      {
               /*This is better ?*/
      }
Community
  • 1
  • 1
Rasmi Ranjan Nayak
  • 11,510
  • 29
  • 82
  • 122

6 Answers6

23

The second "may" be better, because it's easier to compare i with 0 than to compare i with 10 but I think you can use any one of these, because compiler will optimize them.

PhoenixS
  • 1,016
  • 1
  • 9
  • 23
  • Why is it easier to compare `i` with 0 than with 10? – 一二三 Jan 23 '12 at 08:30
  • 2
    +1 cutting the newbie some slack, since his answer is on the right lines – Stuart Golodetz Jan 23 '12 at 08:33
  • 11
    Because most processors have instruction for comparing to zero? So they don't need to load `i` and 10, subtract them and then compare to zero but just compare `i` to zero immediately. – PhoenixS Jan 23 '12 at 08:36
  • 4
    +1 because he's been -1ed with no reason – Unai Vivi Jan 23 '12 at 08:40
  • 3
    @funnyGraphicsThingy - because an explicit compare is usually not necessary to detect zero as a arithmetic or logical result - the CPU zero flag is already set. That, and many instruction sets offer loop/repeat operations that auto-countdown a register to zero. +1 for OP – Martin James Jan 23 '12 at 10:38
  • What is the difference in using `i++` and `++i` ? - I have just found http://stackoverflow.com/questions/467322/is-there-any-performance-difference-between-i-and-i-in-c That answers my question – Luke T O'Brien Sep 02 '16 at 15:09
  • @LukeTO'Brien Post-incrementation - i++ will use i's value in the expression and then increase it by 1. Pre-incrementation will firstly increase i's value by 1 and then will use the result in the expression. – PhoenixS Sep 03 '16 at 07:47
8

I do not think there is much difference between the performance of both loops.

I suppose, it becomes a different situation when the loops look like this.

for(int i = 0; i < getMaximum(); i++)
{
}

for(int i = getMaximum() - 1; i >= 0; i--)
{
}

As the getMaximum() function is called once or multiple times (assuming it is not an inline function)

Veger
  • 37,240
  • 11
  • 105
  • 116
6

Decrement loops down to zero can sometimes be faster if testing against zero is optimised in hardware. But it's a micro-optimisation, and you should profile to see whether it's really worth doing. The compiler will often make the optimisation for you, and given that the decrement loop is arguably a worse expression of intent, you're often better off just sticking with the 'normal' approach.

Stuart Golodetz
  • 20,238
  • 4
  • 51
  • 80
5

Incrementing and decrementing (INC and DEC, when translated into assembler commands) have the same speed of 1 CPU cycle.

However, the second can be theoretically faster on some (e.g. SPARC) architectures because no 10 has to be fetched from memory (or cache): most architectures have instructions that deal in an optimized fashion when compating with the special value 0 (usually having a special hardwired 0-register to use as operand, so no register has to be "wasted" to store the 10 for each iteration's comparison).

A smart compiler (especially if target instruction set is RISC) will itself detect this and (if your counter variable is not used in the loop) apply the second "decrement downto 0" form.

Please see answers https://stackoverflow.com/a/2823164/1018783 and https://stackoverflow.com/a/2823095/1018783 for further details.

Community
  • 1
  • 1
Unai Vivi
  • 3,073
  • 3
  • 30
  • 46
  • Speaking of the execution time of one instruction doesn't make sense on superscalar architecture. Desktop processor have been superscalar for about the last 15 years (the Pentium in 1993 was superscalar). – AProgrammer Jan 23 '12 at 08:54
  • Not all modern hardware is pipelined. Embedded processors will have much more relevance to this sort of microoptimization. Furthermore, OP question didn't address modern desktop processors in particular, so I believe answer should be generic and adress also 20 years old processors designs (which oftentimes still serve as embedded CPUs) – Unai Vivi Jan 23 '12 at 09:06
  • the problem is that the answer is it depends. It depends if the processor ISA has better support for one kind or the other. It depends if the micro-architecture is able to take advantage of the ISA support. If the compiler is able to take advantage as well (and is incapable of taking counter measure such as keeping the 10 in a register, seeing that the loop is executed an known small number of time and totally unroll it or use an auxiliary decremented counter). There are far too many variables. Measure and measure in context is the only answer, and it is valid for all applications. – AProgrammer Jan 23 '12 at 09:32
  • You are right. However, given that no combination for hardware type, compiler type or version have been specified, I'd rather point out something generical that is faster for very few of such combinations (while being slower for none), rather than focusing on the fact that unquestionably it makes little sense to do these out-of-context microbenchmarks in the first place. – Unai Vivi Jan 23 '12 at 09:42
2

The compiler should optimize both code to the same assembly, so it doesn't make a difference. Both take the same time.

A more valid discussion would be whether

  for(int i=0;i<10;++i)  //preincrement
  {
  }

would be faster than

  for(int i=0;i<10;i++)  //postincrement
  {
  }

Because, theoretically, post-increment does an extra operation (returns a reference to the old value). However, even this should be optimized to the same assembly.

Without optimizations, the code would look like this:

   for ( int i = 0; i < 10 ; i++ )
0041165E  mov         dword ptr [i],0 
00411665  jmp         wmain+30h (411670h) 
00411667  mov         eax,dword ptr [i] 
0041166A  add         eax,1 
0041166D  mov         dword ptr [i],eax 
00411670  cmp         dword ptr [i],0Ah 
00411674  jge         wmain+68h (4116A8h) 

   for ( int i = 0; i < 10 ; ++i )
004116A8  mov         dword ptr [i],0 
004116AF  jmp         wmain+7Ah (4116BAh) 
004116B1  mov         eax,dword ptr [i] 
004116B4  add         eax,1 
004116B7  mov         dword ptr [i],eax 
004116BA  cmp         dword ptr [i],0Ah 
004116BE  jge         wmain+0B2h (4116F2h) 

   for ( int i = 9; i >= 0 ; i-- )
004116F2  mov         dword ptr [i],9 
004116F9  jmp         wmain+0C4h (411704h) 
004116FB  mov         eax,dword ptr [i] 
004116FE  sub         eax,1 
00411701  mov         dword ptr [i],eax 
00411704  cmp         dword ptr [i],0 
00411708  jl          wmain+0FCh (41173Ch) 

so even in this case, the speed is the same.

Luchian Grigore
  • 253,575
  • 64
  • 457
  • 625
1

Again, the answer to all micro-performance questions is measure, measure in context of use and don't extrapolate to other contexts.

Counting instruction execution time hasn't been possible without extraordinary sophistication for quite a long time.

The mismatch between processors and memory speed and the introduction of cache to hide part of the latency (but not the bandwidth) makes the execution of a group of instructions very sensitive to memory access pattern. That is something you still can optimize for with a quite high level thinking. But it also means that something apparently worse if one doesn't take the memory access pattern into account can be better once that is done.

Then superscalar (the fact that the processor can do several things at once) and out of order execution (the fact that processor can execute an instruction before a previous one in the flow) makes basic counting meaningless even if you ignore memory access. You have to know which instructions need to be executed (so ignoring part of the structure isn't wise) and how the processor can group instructions if you want to get good a priori estimate.

Stuart Golodetz
  • 20,238
  • 4
  • 51
  • 80
AProgrammer
  • 51,233
  • 8
  • 91
  • 143