TL;DR: the first loop runs ~18% faster on a Haswell CPU. Why? The loops are from gcc -O0
(un-optimized) loops using ptr++
vs ++ptr
, but the question is why the resulting asm performs differently, not anything about how to write better C.
Let's say we have those two loops:
movl $0, -48(%ebp) //Loop counter set to 0
movl $_data, -12(%ebp) //Pointer to the data array
movl %eax, -96(%ebp)
movl %edx, -92(%ebp)
jmp L21
L22:
// ptr++
movl -12(%ebp), %eax //Get the current address
leal 4(%eax), %edx //Calculate the next address
movl %edx, -12(%ebp) //Store the new (next) address
// rest of the loop is the same as the other
movl -48(%ebp), %edx //Get the loop counter to edx
movl %edx, (%eax) //Move the loop counter value to the CURRENT address, note -12(%ebp) contains already the next one
addl $1, -48(%ebp) //Increase the counter
L21:
cmpl $999999, -48(%ebp)
jle L22
and the second one:
movl %eax, -104(%ebp)
movl %edx, -100(%ebp)
movl $_data-4, -12(%ebp) //Address of the data - 1 element (4 byte)
movl $0, -48(%ebp) //Set the loop counter to 0
jmp L23
L24:
// ++ptr
addl $4, -12(%ebp) //Calculate the CURRENT address by adding one sizeof(int)==4 bytes
movl -12(%ebp), %eax //Store in eax the address
// rest of the loop is the same as the other
movl -48(%ebp), %edx //Store in edx the current loop counter
movl %edx, (%eax) //Move the loop counter value to the current stored address location
addl $1, -48(%ebp) //Increase the loop counter
L23:
cmpl $999999, -48(%ebp)
jle L24
Those loops are doing exactly the same thing but in a little different manner, please refer to the comment for the details.
This asm code is generated from the following two C++ loops:
//FIRST LOOP:
for(;index<size;index++){
*(ptr++) = index;
}
//SECOND LOOP:
ptr = data - 1;
for(index = 0;index<size;index++){
*(++ptr) = index;
}
Now, the first loop is about ~18% faster than the second one, no matter in which order the loops are performed the one with ptr++
is faster than the one with ++ptr
.
To run my benchmarks I just collected the running time of those loops for different size, and executing them both nested in other loops to repeat the operation frequently.
ASM analysis
Looking at the ASM code, the second loop contains less instructions, we have 3 movl and 2 addl whereas in the first loop we have 4 movl one addl and one leal, so we have one movl more and one leal instead of addl
Is that correct that the LEA
operation for computing the correct address is much faster than the ADD
(+4) method? Is this the reason for the difference in performance?
As far as i know, once a new address is calculated before the memory can be referenced some clock cycles must elapse, so the second loop after the addl $4,-12(%ebp) need to wait a little before proceeding, whereas in the first loop we can immediately refer the memory and in the meanwhile LEAL will compute the next address (some kind of better pipeline performance here).
Is there some reordering going on here? I'm not sure about my explanation for the performance difference of those loops, can I have your opinion?