1

The only difference between the following two tested loop codes is i += shift and i += shifts[str[i]]. They have the same character set size, the same stride of each shift, and the same number of comparisons. The branch prediction and cache hit ratio are also almost the same. What makes me difficult to understand is why Test1 is several times faster than Test2. Test2 uses a small array (length 8) for shifting, so that it will hardly be affected by the cache.

Is it so much slower to access an array than to access a variable? I only know that accessing the array will be a little bit slower than just accessing the variables (if cache hits are ignored, it can even be ignored), but it will definitely not be so much different. If it is not for this reason, what causes it?

The test environment: System: Mac OS, Cpu: core i9, Compiler: clang 11, Compilation options: -O3 and -O0

The C code:

#include <stdio.h>
#include <time.h>
#include <stdlib.h>
#include <string.h>

#define n     134217728
#define Alpha 8
int c1 = 0, c2 = 0;
// Fixed shift assigned during execution.
int shift;
// Create a table for Test2 shift.
int shifts[Alpha];

void func1(char *str, char ch);
void func2(char *str, char ch);

int main(int argc, const char * argv[]) {
    srandom((unsigned)time(NULL));
    printf("shift = "); scanf("%d", &shift);
    
    // Elements of the shifts are all fixed values(shift).
    for (int i = 0; i < Alpha; ++i) shifts[i] = shift;
    
    // Create a random string with length 134217728 and alphabet 0 ~ 7.
    char *str = malloc(n);
    for (int i = 0; i < n; ++i) str[i] = random()%Alpha;

    // Create a character(not assigned, random 0 ~ 7 in the test).
    char ch;

    // Test1 ============================================================
    double duration1 = 0;
    for (int i = 0; i < 128; ++i) {
        ch = random()%Alpha; // random char
        clock_t begin = clock();        // time1 begin ------------------
        func1(str, ch);
        clock_t end  = clock();         // time1 end --------------------
        duration1 += ((double)(end - begin)) / CLOCKS_PER_SEC;
    }
    printf("1-st: %lf s, and count1: %d\n", (duration1/128.0), c1/128);

    // Test2 ============================================================
    double duration2 = 0;
    for (int i = 0; i < 128; ++i) {
        ch = random()%Alpha; // random char
        clock_t begin = clock();        // time2 begin ------------------
        func2(str, ch);
        clock_t end  = clock();         // time2 end --------------------
        duration2 += ((double)(end - begin)) / CLOCKS_PER_SEC;
    }
    printf("2-st: %lf s, and count2: %d\n", (duration2/128.0), c2/128);

    return 0;
}

void func1(char *str, char ch){
    for (int i = 0; i < n; i += shift)
        if (str[i] == ch) c1++;
}

void func2(char *str, char ch){
    for (int i = 0; i < n; i += shifts[str[i]])
        if (str[i] == ch) c2++;
}

The output of the following code is the average of 128 cycles of testing. The code for short shift (16), medium shift (64) and long shift (1024) tests were compiled with the -O3 and -O0 options respectively. I want to ensure that shift is a runtime variable (it cannot be optimized by the compiler to be a constant, and then additional and special optimization is performed), so I get its value during execution.

  1. The -O3 compilation option
    shift = 16
    1-st: 0.015870 s, and count1: 1048746
    2-st: 0.036174 s, and count2: 1048650
    
    shift = 64
    1-st: 0.009782 s, and count1: 262195
    2-st: 0.016102 s, and count2: 262160
    
    shift = 1024
    1-st: 0.001339 s, and count1: 16375
    2-st: 0.009361 s, and count2: 16402
    
  2. The -O0 compilation option
    shift = 16
    1-st: 0.026950 s, and count1: 1048651
    2-st: 0.045186 s, and count2: 1048541
    
    shift = 64
    1-st: 0.010594 s, and count1: 262129
    2-st: 0.018006 s, and count2: 262140
    
    shift = 1024
    1-st: 0.001665 s, and count1: 16359
    2-st: 0.007348 s, and count2: 16381
    

The following is the assembly code of the two functions:

func1:
        mov     ecx, DWORD PTR shift[rip]
        mov     rax, rdi
        lea     rdx, 134217728[rdi]
        cmp     ecx, 1
        jne     .L13
.L7:
        cmp     sil, BYTE PTR [rax]
        jne     .L6
        add     DWORD PTR c1[rip], 1
.L6:
        add     rax, 1
        cmp     rdx, rax
        jne     .L7
        ret
.L13:
        movsx   rdx, ecx
        xor     eax, eax
.L4:
        cmp     sil, BYTE PTR [rdi]
        jne     .L3
        add     DWORD PTR c1[rip], 1
.L3:
        add     eax, ecx
        add     rdi, rdx
        cmp     eax, 134217727
        jle     .L4
        ret
func2:
        xor     edx, edx
        lea     r8, shifts[rip]
.L18:
        movsx   rax, edx
        add     rax, rdi
        movsx   rcx, BYTE PTR [rax]
        cmp     sil, cl
        je      .L19
        add     edx, DWORD PTR [r8+rcx*4]
        cmp     edx, 134217727
        jle     .L18
        ret
.L19:
        add     DWORD PTR c2[rip], 1
        movsx   rax, BYTE PTR [rax]
        add     edx, DWORD PTR [r8+rax*4]
        cmp     edx, 134217727
        jle     .L18
        ret

This is the result after I defined Alpha as 2 (that is, the size of shifts is 2), -O3:

shift = 16
1-st: 0.033265 s, and count1: 4194253
2-st: 0.051381 s, and count2: 4194253

shift = 64
1-st: 0.012764 s, and count1: 1048683
2-st: 0.019852 s, and count2: 1048576

shift = 1024
1-st: 0.001424 s, and count1: 65544
2-st: 0.007971 s, and count2: 65537

The performance difference is still so great, so it can be seen that the main reason must not be caused by the difference in register and cache performance. (ps: Perhaps reason is related to the CPU pipeline, please give me a detailed explanation from the hardware expert, thank you very much.)

Linke
  • 336
  • 1
  • 10
  • 3
    `shift` is an `int` that can probably be stored in a register during the whole loop and `shifts[str[i]]` requires two array lookups. The `-O0` compilation isn't interesting so I think you can just remove that. – Ted Lyngmo Oct 17 '21 at 11:37
  • @ Ted Lyngmo Yep, even so, it will not have such a big impact on performance (or even several times). – Linke Oct 17 '21 at 11:46
  • 4
    I'm confused. The performance diff in that tight loop can very well be explained by what I mentioned. – Ted Lyngmo Oct 17 '21 at 12:02
  • There are a lot of time-consuming parts in each loop, for example, branch prediction is very expensive, in addition to some operations on other variables. Therefore, the small performance difference between them is not enough to affect so much overall cycle efficiency. – Linke Oct 17 '21 at 12:08
  • Did you look into the generated assembly? You should be able to find some reasons quite easily. – the busybee Oct 17 '21 at 12:16
  • @Linke As far as I can see, the only diff between these loops is the double array lookup that _will_ be much more expensive than the unchanging `shift` unless the optimizer manages to figure out that the double lookup will result in the same value. – Ted Lyngmo Oct 17 '21 at 12:17
  • @ Ted Lyngmo Under the -O3 compilation option, when shift=1024, they even differ by 9361/1339 = 7. In addition, this should not be regarded as a "double array", but equivalent to accessing a small array, because the value of `str[i]` in `shifts[str[i]]` will be stored in the register after the last comparison middle. And `shifts` is just a small array of length 8. – Linke Oct 17 '21 at 12:33
  • 2
    Please read [this post](https://stackoverflow.com/questions/69211522/reduce-the-number-of-executions-by-3-times-but-the-execution-efficiency-is-almo/69214698#69214698). The difference here is that all the values in `shifts` are the same and so the result is less random but most processors cannot perform speculative branching here because of the `shifts` array that need to be read (and most processors cannot know that all the values in the array are the same and left unmodified at runtime (the compiler could optimize it but actually do not). A stall is about 12-14 cycles on a modern x86 cpu. – Jérôme Richard Oct 17 '21 at 13:21
  • 3
    @Linke, I'm having no more luck than Ted in understanding what you find surprising here. In your `func1`, the same shift is applied at every iteration of the loop. Even if the compiler can't inline the value, it only needs to read the value from memory once. On the other hand, your `func2`, has to read the shift from memory at every iteration. Even with those reads being served from cache, that's a lot more expensive than using a value that's available from a register. The loop body is so tight that it doesn't surprise me that the relative performance impact is significant. – John Bollinger Oct 17 '21 at 14:53
  • @Linke `shifts[str[i]]` is equivalent to `*(shifts + *(str + i))`. Compare that to keeping the same value in a register to do the addition. That's what's making it expensive. Sidenote: When comparing stuff, make sure they use the same data. You use different random sequences for the loops. Not that it matters much for the time here but it's generally a good idea. [Example](https://godbolt.org/z/c99ja8o7c) – Ted Lyngmo Oct 17 '21 at 14:53
  • @ Jérôme Richard I still don’t quite understand, if the reason is what you said, then I would like to ask: why when shift=16→64 →1024, the performance ratio of the two will first decrease (16→64) then Increase (→1024). – Linke Oct 17 '21 at 15:01
  • Up to now, I think the comment of Jérôme Richard may be the closest accurate answer. The reason for this situation should not be caused by the difference in the performance of the access register and the cache (it can also be said that it has little impact), but should be related to the CPU The pipeline is related. But the details are still not clear. I hope someone can give me a detailed description, thank you very much. – Linke Oct 17 '21 at 15:13
  • If you want to see whether it's branch prediction, you could try to rewrite it to be branchless. The problem currently is that `c2` is global and so `c2++` requires an actual load and store, which the code must only do if the condition is true (the compiler can't invent loads and stores), so it has to actually jump over that code if the condition is false. However, if you rewrite it using a local variable in place of `c2`, which you load once and the beginning of the function and store back at the end, you have a hope of getting it compiled as a conditional move, with no branch. – Nate Eldredge Oct 17 '21 at 15:39
  • 1
    Oh yeah, see https://godbolt.org/z/5xKxWo1Pe. Here `func1` is not only made branchless but also vectorized (when `shift == 1`). `func2` can't be vectorized but is still branchless (`sete` to get either 0 or 1 according to the truth of the condition, and then add that 0 or 1 unconditionally). – Nate Eldredge Oct 17 '21 at 15:43
  • 2
    Having `shifts[str[i]]` added to the loop variable creates a loop-carried dependency (and a slow one at that), you could try doing other things with that quantity to disentangle *that* effect, and the effect of merely accessing the array. IDK exactly what, but it would be useful if you figure something like that out and benchmark it so we can compare the times of all three versions – harold Oct 17 '21 at 15:43
  • 3
    [What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?](https://stackoverflow.com/q/51607391) is the general case of this question. Like Harold says, in this case the bottleneck is 2x L1d load-use latency plus ALU `add` latency for `i += shifts[str[i]]`, vs. just `add` latency. – Peter Cordes Oct 17 '21 at 16:08
  • @ Peter Cordes Thank you... – Linke Oct 17 '21 at 17:22

0 Answers0