Why does dereferencing makes my program faster?

Question

Considering the following test programs :

int main( void ) {
    int iterations = 1000000000;

    while ( iterations > 0 )
        -- iterations;
}

Loop value on the stack (dereferenced)

int main( void ) {
    int iterations = 1000000000;
    int * p = & iterations;

    while ( * p > 0 )
        -- * p;
}

Loop value on the heap

#include <stdlib.h>

int main( void ) {
    int * p = malloc( sizeof( int ) );
    * p = 1000000000;

    while ( *p > 0 )
        -- * p;
}

By compiling them with -O0, I get the following execution times :

case1.c
real    0m2.698s
user    0m2.690s
sys     0m0.003s

case2.c
real    0m2.574s
user    0m2.567s
sys     0m0.000s

case3.c
real    0m2.566s
user    0m2.560s
sys     0m0.000s

[edit] Following is the average on 10 executions :

case1.c
2.70364

case2.c
2.57091

case3.c
2.57000

Why is the execution time bigger with the first test case, which seems to be the simplest ?

My current architecture is a x86 virtual machine (Archlinux). I get these results both with gcc (4.8.0) and clang (3.3).

[edit 1] Generated assembler codes are almost identical except that the second and third ones have more instructions than the first one.

[edit 2] These performances are reproducible (on my system). Each execution will have the same order of magnitude.

[edit 3] I don't really care about performances of a non-optimized program, but I don't understand why it would be slower, and I'm curious.

Did you try looking at the generated code? Why do you care about performance of unoptimized code anyway? — Carl Norum, Jul 16 '13 at 23:00
Hae you tried running them in a different order? Are these single-shot times, or averages over a significant number of runs? — user207421, Jul 16 '13 at 23:03
@CarlNorum Almost same generated code, except that there is more instructions (move & load) in the last two examples. I don't care about performances, but I'm still curious :) — Maël Nison, Jul 16 '13 at 23:11
@EJP These performances are reproductibles. Each execution will have the same order of magnitude (tested multiple times). — Maël Nison, Jul 16 '13 at 23:12
@MaëlNison can you post the asm code to make our lives easier — aaronman, Jul 16 '13 at 23:14
Agree with @interjay, there's not much to learn here. Any real differences will be revealed by looking at the ASM. — Oliver Charlesworth, Jul 16 '13 at 23:19
Aside: please [don't typecast the return value from `malloc`](http://stackoverflow.com/questions/605845/do-i-cast-the-result-of-malloc). — Philip Kendall, Jul 16 '13 at 23:19
Wouldn't optimized code be "meaningless" or at least compiler- or platform-specific? — Alex Reynolds, Jul 16 '13 at 23:19
@interjay I don't understand why it would be meaningless. Even with optimisation disabled, I don't understand why would more instructions mean faster. — Maël Nison, Jul 16 '13 at 23:20
@MaëlNison: On a modern CPU, number of instructions correlates weakly with performance. Due to other effects like memory stalls, cache performance, register dependencies, branch prediction, ... — Oliver Charlesworth, Jul 16 '13 at 23:21
You haven't specified your architecture, but assuming it's x86 or x64, instructions don't take a constant time to execute. Hence number of instructions is *definitely* a meaningless measure of speed. — Philip Kendall, Jul 16 '13 at 23:21
You are also timing the execution of your executable, making the assumption that the load times of the executable is roughly identical. This may not be the case. Time the loop itself inside your code. — jxh, Jul 16 '13 at 23:27
@MaëlNison: You say running the programs repeatedly gives times of the same order of magnitude, but since the times are all of the same order of magnitude as each other in the first place, that's pretty meaningless. How do the times of case 1 compare to those of cases 2 & 3? Is the average time for case 1 more than one standard deviation above the other averages? — jwodder, Jul 16 '13 at 23:27
@jwodder 0.12s is a noticeable difference when the difference between cases 2 & 3 is 0.008s. — Maël Nison, Jul 16 '13 at 23:30
@MaëlNison: That's not what I asked. You said you ran the programs repeatedly, but you haven't given us any information about the times beyond the first sample. Give us more timing information. — jwodder, Jul 16 '13 at 23:32
@MaëlNison please specify your cpu. This has to be with processor technologies, confirmed same results on Sandy Bridge. — snf, Jul 16 '13 at 23:41
This question has a common problem, any half-decent compiler will completely remove the code when the optimizer is turned on. It has no observable side-effects. The expected result is therefore *0* seconds. Contemplating where non-optimized code might spend its time is completely useless. — Hans Passant, Jul 17 '13 at 00:10
At first this might look like a premature optimization question, but to me it reads more like: "I don't understand this behavior - can anyone explain it?". I think it's reasonable to want to understand why the non-optimized code sequences in the post have a measurable, consistent, significant (about 5%) difference in timing - especially when the code you'd probably expect to be slightly faster is slower. I think that the fact that the example code would optimize to nothing is irrelevant here - example code is often simplified to an extreme to focus results on a particular issue. — Michael Burr, Jul 17 '13 at 06:05

score 6 · Answer 1 · answered Jul 16 '13 at 23:32

6

It's hard to say if this is the reason since I'm doing some guessing and you haven't given some specifics (like which target you're using). But what I see when I compile without optimziations with an x86 target is the following sequences for decrementign the iterations variable:

Case 1:

L3:
    sub DWORD PTR [esp+12], 1
L2:
    cmp DWORD PTR [esp+12], 0
    jg  L3

Case 2:

L3:
    mov eax, DWORD PTR [esp+12]
    mov eax, DWORD PTR [eax]
    lea edx, [eax-1]
    mov eax, DWORD PTR [esp+12]
    mov DWORD PTR [eax], edx
L2:
    mov eax, DWORD PTR [esp+12]
    mov eax, DWORD PTR [eax]
    test    eax, eax
    jg  L3

One big difference that you see in case 1 is that the instruction at L3 reads and writes the memory location. It is followed immediately byu an instruction that reads the same memory location that was just written. This sort of sequence of instructions (the same memory location written then immediate used in the next instruction) often causes some sort of pipeline stall in modern CPUs.

You'll note that the write followed immediately by a read of the same location is not present in case 2.

Again - this answer is a bit of informed speculation.

answered Jul 16 '13 at 23:32

Michael Burr

333,147
50
533
760

Did you test yours because your asm code is different than the OP's – aaronman Jul 16 '13 at 23:38
OP's seems to be the one from Clang. I got this same asm with gcc 4.8. – snf Jul 16 '13 at 23:45
2

@snf the truth is doubt this question will have an answer interesting enough for anyone to be satisfied with – aaronman Jul 16 '13 at 23:47
1

@aaronman: I didn't - I thought that even if I saw a difference from the OP's results it could easily be accounted for by various other system effects. I am using a gcc 4.8.0 compiler - I thought that the code gen for an x86 target with `-O0` would be pretty darn similar. The details of how a processor's pipeline might handle read/writes to the same location in very quick succession (even if not precisely the next instruction) are largely a mystery to me and probably most people on the planet. – Michael Burr Jul 16 '13 at 23:51
@MichaelBurr I never said you were wrong just wanted to spread awareness – aaronman Jul 16 '13 at 23:52
1

@aaronman: I agree it will be hard to come up with a satisfactory answer other than maybe "the processor may stall when accessing memory in particular patterns". – Michael Burr Jul 16 '13 at 23:52
3

I think everyone wants to be the next [branch prediction](http://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-an-unsorted-array) – aaronman Jul 16 '13 at 23:53
1

I didn't even notice until now that the question had links to the disassembly; I was wondering why many comments were talking about the OPs generated code. – Michael Burr Jul 16 '13 at 23:54
How could you find the corresponding assembly codes of some specific C codes? – Sayakiss Jul 17 '13 at 00:25
1

@Sayakiss: with gcc I use the `-S` option to generate a `.s` file. With MSVC I use the `-FAcs` option to generate a `.cod` file. – Michael Burr Jul 17 '13 at 00:42
2

@Sayakiss You can also use the [GCC Explorer](http://gcc.godbolt.org/) online, which is *awesome*. – Cody Gray - on strike Jul 17 '13 at 04:52

Why does dereferencing makes my program faster?

1 Answers1