First let me show a seemingly very silly question. Assuming that the compiler does not optimize the code, which of the following runs faster?
// Let's call it "direct version"
int a = 1, b = 1;
for (int i = 0; i < 150000000; ++i) {
a += b;
}
or
// Let's call it "indirect version"
int a = 1, b = 1;
int *pa = &a, *pb = &b;
for (int i = 0; i < 150000000; ++i) {
*pa += *pb;
}
I guess most would say the "direct version" runs faster. But when I actually got it run, the result is amazing. The "indirect version" is faster on new Intel CPUs. My program comprise 3 source files.
// File: main.c
#include <stdio.h>
#include <time.h>
extern void direct(void);
extern void indirect(void);
int main()
{
direct(); // Warm up cache
time_t direct_start = clock();
direct();
time_t direct_end = clock();
indirect(); // Warm up cache
time_t indirect_start = clock();
indirect();
time_t indirect_end = clock();
printf("direct duration = %ld\n", direct_end - direct_start);
printf("indirect duration = %ld\n", indirect_end - indirect_start);
return 0;
}
#File: direct.s
.text
.global direct # should be _direct on MacOS
.align 4
direct: # should be _direct on MacOS
pushq %rbp
pushq %rbx
subq $32, %rsp
movl $1, 8(%rsp) # int a = 1;
movl $1, 12(%rsp) # int b = 1;
movl $0, %eax # int i = 0;
movl $150000000, %r8d
.align 4
__loop_condition:
cmpl %r8d, %eax
jge __after_loop # if (i >= ROUND) break;
movl 8(%rsp), %ecx
addl 12(%rsp), %ecx
movl %ecx, 8(%rsp) # a += b;
addl $1, %eax # i++
jmp __loop_condition
__after_loop:
addq $32, %rsp
popq %rbx
popq %rbp
ret
#File: indirect.s
.text
.global indirect # should be _indirect on MacOS
.align 4
indirect: # should be _indirect on MacOS
pushq %rbp
pushq %rbx
subq $32, %rsp
movl $1, 8(%rsp) # int a = 1;
movl $1, 12(%rsp) # int b = 1;
leaq 8(%rsp), %rdx
movq %rdx, 16(%rsp) # int *pa = &a;
leaq 12(%rsp), %rdx
movq %rdx, 24(%rsp) # int *pb = &b;
movl $0, %eax # int i = 0;
movl $150000000, %r8d
.align 4
__loop_condition:
cmpl %r8d, %eax
jge __after_loop # if (i >= ROUND) break;
movq 16(%rsp), %rbx
movl (%rbx), %ecx
movq 24(%rsp), %rdx
addl (%rdx), %ecx
movl %ecx, (%rbx) # *pa += *pb;
addl $1, %eax # i++
jmp __loop_condition
__after_loop:
addq $32, %rsp
popq %rbx
popq %rbp
ret
After compiling and linking them together, I ran the program on several machines. It turned out the "indirect version" is faster on newer Intel CPUs.
+--------+----------+--------------------------+--------------------------------------------+--+
| Direct | Indirect | CPU | OS | |
+--------+----------+--------------------------+--------------------------------------------+--+
| 297200 | 297136 | Core i3 540 @ 3.07GHz | Linux 4.18.8-041808-generic x86_64 | |
| 298989 | 297024 | Core i7-4720HQ @ 2.60GHz | Linux 5.0.0-36-generic x86_64 | |
| 281250 | 265625 | Core i7-6700HQ @ 2.60GHz | Linux 4.4.0-18362-Microsoft x86_64 | |
| 531250 | 500000 | Core i7-8550U @ 1.80GHz | Linux 4.4.0-18362-Microsoft x86_64 | |
| 191926 | 181627 | Core i7-8850H @ 2.60GHz | Darwin Darwin Kernel Version 19.0.0 x86_64 | |
+--------+----------+--------------------------+--------------------------------------------+--+
The "indirect version" has longer machine code in the loop body, so its spacial locality is worse. It also has more memory addressing. According to these, "indirect version" should be slower, or at least not faster than "direct version" if we taking the out-of-order and speculative execution into account.
So why is it?
Edit:
the command for compiling is
gcc -O2 -o main main.c direct.s indirect.s
put the "indirect version" before "direct version" does not change the result on my Mac.
Weather Vane said "I would like to place both versions in the same source file, and run each version twice, alternately, showing the timing for each." The result is:
direct duration 1st = 196822
indirect duration 1st = 185616
direct duration 2nd = 193951
indirect duration 2nd = 182993
- dohashi made a guess that "the final movl in the direct case requires a computation relative to the stack pointer, whereas in the indirect case it does not".
I added a line
leaq 8(%rsp), %rdi
just before the__loop_condition
, and substitutedmovl %ecx, 8(%rsp)
withmovl %ecx, (%rdi)
in the "direct" version, but no significant change.