The only difference between the following two tested loop codes is i += shift
and i += shifts[str[i]]
. They have the same character set size, the same stride of each shift, and the same number of comparisons. The branch prediction and cache hit ratio are also almost the same. What makes me difficult to understand is why Test1 is several times faster than Test2. Test2 uses a small array (length 8) for shifting, so that it will hardly be affected by the cache.
Is it so much slower to access an array than to access a variable? I only know that accessing the array will be a little bit slower than just accessing the variables (if cache hits are ignored, it can even be ignored), but it will definitely not be so much different. If it is not for this reason, what causes it?
The test environment: System: Mac OS, Cpu: core i9, Compiler: clang 11, Compilation options: -O3 and -O0
The C code:
#include <stdio.h>
#include <time.h>
#include <stdlib.h>
#include <string.h>
#define n 134217728
#define Alpha 8
int c1 = 0, c2 = 0;
// Fixed shift assigned during execution.
int shift;
// Create a table for Test2 shift.
int shifts[Alpha];
void func1(char *str, char ch);
void func2(char *str, char ch);
int main(int argc, const char * argv[]) {
srandom((unsigned)time(NULL));
printf("shift = "); scanf("%d", &shift);
// Elements of the shifts are all fixed values(shift).
for (int i = 0; i < Alpha; ++i) shifts[i] = shift;
// Create a random string with length 134217728 and alphabet 0 ~ 7.
char *str = malloc(n);
for (int i = 0; i < n; ++i) str[i] = random()%Alpha;
// Create a character(not assigned, random 0 ~ 7 in the test).
char ch;
// Test1 ============================================================
double duration1 = 0;
for (int i = 0; i < 128; ++i) {
ch = random()%Alpha; // random char
clock_t begin = clock(); // time1 begin ------------------
func1(str, ch);
clock_t end = clock(); // time1 end --------------------
duration1 += ((double)(end - begin)) / CLOCKS_PER_SEC;
}
printf("1-st: %lf s, and count1: %d\n", (duration1/128.0), c1/128);
// Test2 ============================================================
double duration2 = 0;
for (int i = 0; i < 128; ++i) {
ch = random()%Alpha; // random char
clock_t begin = clock(); // time2 begin ------------------
func2(str, ch);
clock_t end = clock(); // time2 end --------------------
duration2 += ((double)(end - begin)) / CLOCKS_PER_SEC;
}
printf("2-st: %lf s, and count2: %d\n", (duration2/128.0), c2/128);
return 0;
}
void func1(char *str, char ch){
for (int i = 0; i < n; i += shift)
if (str[i] == ch) c1++;
}
void func2(char *str, char ch){
for (int i = 0; i < n; i += shifts[str[i]])
if (str[i] == ch) c2++;
}
The output of the following code is the average of 128 cycles of testing. The code for short shift (16), medium shift (64) and long shift (1024) tests were compiled with the -O3 and -O0 options respectively. I want to ensure that shift
is a runtime variable (it cannot be optimized by the compiler to be a constant, and then additional and special optimization is performed), so I get its value during execution.
- The -O3 compilation option
shift = 16 1-st: 0.015870 s, and count1: 1048746 2-st: 0.036174 s, and count2: 1048650 shift = 64 1-st: 0.009782 s, and count1: 262195 2-st: 0.016102 s, and count2: 262160 shift = 1024 1-st: 0.001339 s, and count1: 16375 2-st: 0.009361 s, and count2: 16402
- The -O0 compilation option
shift = 16 1-st: 0.026950 s, and count1: 1048651 2-st: 0.045186 s, and count2: 1048541 shift = 64 1-st: 0.010594 s, and count1: 262129 2-st: 0.018006 s, and count2: 262140 shift = 1024 1-st: 0.001665 s, and count1: 16359 2-st: 0.007348 s, and count2: 16381
The following is the assembly code of the two functions:
func1:
mov ecx, DWORD PTR shift[rip]
mov rax, rdi
lea rdx, 134217728[rdi]
cmp ecx, 1
jne .L13
.L7:
cmp sil, BYTE PTR [rax]
jne .L6
add DWORD PTR c1[rip], 1
.L6:
add rax, 1
cmp rdx, rax
jne .L7
ret
.L13:
movsx rdx, ecx
xor eax, eax
.L4:
cmp sil, BYTE PTR [rdi]
jne .L3
add DWORD PTR c1[rip], 1
.L3:
add eax, ecx
add rdi, rdx
cmp eax, 134217727
jle .L4
ret
func2:
xor edx, edx
lea r8, shifts[rip]
.L18:
movsx rax, edx
add rax, rdi
movsx rcx, BYTE PTR [rax]
cmp sil, cl
je .L19
add edx, DWORD PTR [r8+rcx*4]
cmp edx, 134217727
jle .L18
ret
.L19:
add DWORD PTR c2[rip], 1
movsx rax, BYTE PTR [rax]
add edx, DWORD PTR [r8+rax*4]
cmp edx, 134217727
jle .L18
ret
This is the result after I defined Alpha
as 2
(that is, the size of shifts
is 2
), -O3:
shift = 16
1-st: 0.033265 s, and count1: 4194253
2-st: 0.051381 s, and count2: 4194253
shift = 64
1-st: 0.012764 s, and count1: 1048683
2-st: 0.019852 s, and count2: 1048576
shift = 1024
1-st: 0.001424 s, and count1: 65544
2-st: 0.007971 s, and count2: 65537
The performance difference is still so great, so it can be seen that the main reason must not be caused by the difference in register and cache performance. (ps: Perhaps reason is related to the CPU pipeline, please give me a detailed explanation from the hardware expert, thank you very much.)