Why a redundant indirect addressing makes program to run faster on x86 CPUs?

Question

First let me show a seemingly very silly question. Assuming that the compiler does not optimize the code, which of the following runs faster?

// Let's call it "direct version"
int a = 1, b = 1;
for (int i = 0; i < 150000000; ++i) {
    a += b;
}

or

// Let's call it "indirect version"
int a = 1, b = 1;
int *pa = &a, *pb = &b;
for (int i = 0; i < 150000000; ++i) {
    *pa += *pb;
}

I guess most would say the "direct version" runs faster. But when I actually got it run, the result is amazing. The "indirect version" is faster on new Intel CPUs. My program comprise 3 source files.

// File: main.c
#include <stdio.h>
#include <time.h>

extern void direct(void);
extern void indirect(void);

int main()
{
    direct(); // Warm up cache
    time_t direct_start = clock();
    direct();
    time_t direct_end = clock();

    indirect(); // Warm up cache
    time_t indirect_start = clock();
    indirect();
    time_t indirect_end = clock();

    printf("direct duration = %ld\n", direct_end - direct_start);
    printf("indirect duration = %ld\n", indirect_end - indirect_start);
    return 0;
}

#File: direct.s
    .text
    .global     direct # should be _direct on MacOS
    .align      4
direct: # should be _direct on MacOS
    pushq   %rbp
    pushq   %rbx
    subq    $32, %rsp
    movl    $1, 8(%rsp)         # int a = 1;
    movl    $1, 12(%rsp)        # int b = 1;

    movl    $0, %eax            # int i = 0;
    movl    $150000000, %r8d

    .align 4
__loop_condition:
    cmpl    %r8d, %eax
    jge     __after_loop        # if (i >= ROUND) break;

    movl    8(%rsp), %ecx
    addl    12(%rsp), %ecx
    movl    %ecx, 8(%rsp)       # a += b;

    addl    $1, %eax            # i++
    jmp __loop_condition
__after_loop:
    addq    $32, %rsp
    popq    %rbx
    popq    %rbp
    ret

#File: indirect.s
    .text
    .global     indirect # should be _indirect on MacOS
    .align      4
indirect: # should be _indirect on MacOS
    pushq   %rbp
    pushq   %rbx
    subq    $32, %rsp
    movl    $1, 8(%rsp)         # int a = 1;
    movl    $1, 12(%rsp)        # int b = 1;

    leaq    8(%rsp), %rdx
    movq    %rdx, 16(%rsp)      # int *pa = &a;

    leaq    12(%rsp), %rdx
    movq    %rdx, 24(%rsp)      # int *pb = &b;

    movl    $0, %eax            # int i = 0;
    movl    $150000000, %r8d

    .align 4
__loop_condition:
    cmpl    %r8d, %eax
    jge     __after_loop        # if (i >= ROUND) break;

    movq    16(%rsp), %rbx
    movl    (%rbx), %ecx
    movq    24(%rsp), %rdx
    addl    (%rdx), %ecx
    movl    %ecx, (%rbx)        # *pa += *pb;

    addl    $1, %eax            # i++
    jmp __loop_condition
__after_loop:
    addq    $32, %rsp
    popq    %rbx
    popq    %rbp
    ret

After compiling and linking them together, I ran the program on several machines. It turned out the "indirect version" is faster on newer Intel CPUs.

+--------+----------+--------------------------+--------------------------------------------+--+
| Direct | Indirect |           CPU            |                     OS                     |  |
+--------+----------+--------------------------+--------------------------------------------+--+
| 297200 |   297136 | Core i3 540  @ 3.07GHz   | Linux 4.18.8-041808-generic x86_64         |  |
| 298989 |   297024 | Core i7-4720HQ @ 2.60GHz | Linux 5.0.0-36-generic x86_64              |  |
| 281250 |   265625 | Core i7-6700HQ @ 2.60GHz | Linux 4.4.0-18362-Microsoft x86_64         |  |
| 531250 |   500000 | Core i7-8550U @ 1.80GHz  | Linux 4.4.0-18362-Microsoft x86_64         |  |
| 191926 |   181627 | Core i7-8850H @ 2.60GHz  | Darwin Darwin Kernel Version 19.0.0 x86_64 |  |
+--------+----------+--------------------------+--------------------------------------------+--+

The "indirect version" has longer machine code in the loop body, so its spacial locality is worse. It also has more memory addressing. According to these, "indirect version" should be slower, or at least not faster than "direct version" if we taking the out-of-order and speculative execution into account.

So why is it?

Edit:

the command for compiling is gcc -O2 -o main main.c direct.s indirect.s
put the "indirect version" before "direct version" does not change the result on my Mac.
Weather Vane said "I would like to place both versions in the same source file, and run each version twice, alternately, showing the timing for each." The result is:

direct duration 1st = 196822
indirect duration 1st = 185616
direct duration 2nd = 193951
indirect duration 2nd = 182993

dohashi made a guess that "the final movl in the direct case requires a computation relative to the stack pointer, whereas in the indirect case it does not". I added a line leaq 8(%rsp), %rdi just before the __loop_condition, and substituted movl %ecx, 8(%rsp) with movl %ecx, (%rdi) in the "direct" version, but no significant change.

@EugeneSh. 150000000 is smaller than 2147483647, it's safe :). — Zhiyao, Nov 18 '19 at 16:35
How exactly are you are compiling the code for each of these tests? Are you using the same executable for all of the "Linux" tests? — Ross Ridge, Nov 18 '19 at 16:38
Try changing your program to test the “indirect” code first. This looks a lot like it is an issue of warmup more than of performance. — fuz, Nov 18 '19 at 16:39
@RossRidge No, they were compiled from source each time. I edited the question and put the command at the end. — Zhiyao, Nov 18 '19 at 16:47
I would like to place both versions in the same source file, and run each version twice, alternately, showing the timing for each. — Weather Vane, Nov 18 '19 at 16:48
@WeatherVane Switching the orders does not change the timings significantly. I was able to reproduce this. — fuz, Nov 18 '19 at 16:54
Just a guess: the final movl in the direct case requires a computation relative to the stack pointer, whereas in the indirect case it does not. — dohashi, Nov 18 '19 at 16:56
You should really change the body of your question to ask why the indirect version of your handcrafted assembly code is faster on certain CPUs. There's a lot of different ways compilers will generate "unoptimized" code for your C examples, none of which are likely to be same as your assembly code. For example they won't keep `i` in a register, nor the constant 150000000. — Ross Ridge, Nov 18 '19 at 18:22
@RossRidge It is pretty clear to me that the question asks just that. The C code in the beginning is merely a motivation for what follows. — fuz, Nov 18 '19 at 18:27
@fuz This post opens with a clearly stated question that differs significantly from less clearly stated question at the end. So, no, it's not at all clear. It doesn't help that the post is tagged `[c]` when it's not actually about C code or C compilers. — Ross Ridge, Nov 18 '19 at 18:44
@fuz: I tend to agree with Ross; I was expecting this to be a duplicate of the usual Sandybridge store-forwarding with `-O0`, but then saw `-O2` in the compile options. Then I had to read the `.s` files: it looks like they are `gcc -O0 -fomit-frame-pointer` compiler output. They keep everything in memory but use RSP-relative addressing for locals. So yes, like I expected, yet another "discovery" of SnB store-fowarding being faster if you don't try right away, speeding up convoluted `-O0` code. — Peter Cordes, Nov 19 '19 at 05:09

Why a redundant indirect addressing makes program to run faster on x86 CPUs?

0 Answers0

Linked