How does `g++ -O3` generated `xmm` things affect branch prediction for unsorted array?

Question

I think code tells everything:

➜ project cat branch_prediction_victim.cpp

#include <algorithm>
#include <ctime>
#include <iostream>

int main(int argc, char* argv[])
{
    const size_t array_length = 32768;
    int array_aka[array_length];
    std::srand(std::time(nullptr));
    for (size_t i = 0; i < array_length; ++i)
        array_aka[i] = std::rand() % 256;

    // !!! With this, the next loop runs faster.
    //std::sort(array_aka, array_aka + array_length);

    clock_t start;
    long long sum;
    double elapsed_seconds;

    sum = 0;
    start = clock();
    for (int i = 0; i < 100000; ++i)
        for (size_t j = 0; j < array_length; ++j)
            if (array_aka[j] >= 128)
                sum += array_aka[j];
    elapsed_seconds = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;
    std::cout << "unsorted(>= 128):     \t" << elapsed_seconds << std::endl;
    std::cout << "sum =                 \t" << sum << std::endl;

    sum = 0;
    start = clock();
    for (int i = 0; i < 100000; ++i)
        for (size_t j = 0; j < array_length; ++j)
            if (array_aka[j] >= 64)
                sum += array_aka[j];
    elapsed_seconds = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;
    std::cout << "unsorted(>= 64):      \t" << elapsed_seconds << std::endl;
    std::cout << "sum =                 \t" << sum << std::endl;

    sum = 0;
    start = clock();
    for (int i = 0; i < 100000; ++i)
        for (size_t j = 0; j < array_length; ++j)
            if (__builtin_expect((array_aka[j] >= 64), 1))
                sum += array_aka[j];
    elapsed_seconds = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;
    std::cout << "unsorted(>= 64)(hint):\t" << elapsed_seconds << std::endl;
    std::cout << "sum =                 \t" << sum << std::endl;

    std::sort(array_aka, array_aka + array_length);

    sum = 0;
    start = clock();
    for (int i = 0; i < 100000; ++i)
        for (size_t j = 0; j < array_length; ++j)
            if (array_aka[j] >= 128)
                sum += array_aka[j];
    elapsed_seconds = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;
    std::cout << "sorted(>= 128):       \t" << elapsed_seconds << std::endl;
    std::cout << "sum =                 \t" << sum << std::endl;

    sum = 0;
    start = clock();
    for (int i = 0; i < 100000; ++i)
        for (size_t j = 0; j < array_length; ++j)
            if (array_aka[j] >= 64)
                sum += array_aka[j];
    elapsed_seconds = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;
    std::cout << "sorted(>= 64):        \t" << elapsed_seconds << std::endl;
    std::cout << "sum =                 \t" << sum << std::endl;

    sum = 0;
    start = clock();
    for (int i = 0; i < 100000; ++i)
        for (size_t j = 0; j < array_length; ++j)
            if (__builtin_expect((array_aka[j] >= 64), 1))
                sum += array_aka[j];
    elapsed_seconds = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;
    std::cout << "sorted(>= 64)(hint):  \t" << elapsed_seconds << std::endl;
    std::cout << "sum =                 \t" << sum << std::endl;
}

➜ project g++ -o run branch_prediction_victim.cpp && ./run

unsorted(>= 128):       19.3221
sum =                   312708500000
unsorted(>= 64):        13.5077         <-- branch prediction friendly there is majority
sum =                   391463200000
unsorted(>= 64)(hint):  13.6219         <-- __builtin_expect has no effect
sum =                   391463200000
sorted(>= 128):         6.38736         <-- branch prediction friendly with sorted array
sum =                   312708500000
sorted(>= 64):          6.35014
sum =                   391463200000
sorted(>= 64)(hint):    6.18703
sum =                   391463200000

➜ project g++ -O3 -o run branch_prediction_victim.cpp && ./run

unsorted(>= 128):       0.787082
sum =                   314110000000
unsorted(>= 64):        0.777937
sum =                   392066300000
unsorted(>= 64)(hint):  0.783885
sum =                   392066300000
sorted(>= 128):         0.782302
sum =                   314110000000
sorted(>= 64):          0.782116
sum =                   392066300000
sorted(>= 64)(hint):    0.785842
sum =                   392066300000

As we can see, 1. with -O3 we need neither make majority nor sorted array, 2. __builtin_expect has no use here. It seems we need to check out assembly for the reasons

➜ project g++ -O3 -Wa,-a -g branch_prediction_victim.cpp > branch_prediction_victim.S

 834                    .L74:
 835 0080 660FEFD2              pxor    %xmm2, %xmm2
 836 0084 4889EB                movq    %rbp, %rbx
 837                    .LVL108:
 838 0087 660F1F84              .p2align 4,,10
 838      00000000
 838      00
 839                            .p2align 3
 840                    .L73:
 841                    .LBB547:
 842                    .LBB548:
  22:branch_prediction_victim.cpp ****     for (int i = 0; i < 100000; ++i)
  23:branch_prediction_victim.cpp ****         for (size_t j = 0; j < array_length; ++j)
  24:branch_prediction_victim.cpp ****             if (array_aka[j] >= 128)
 843                            .loc 5 24 0
 844 0090 660F6F03              movdqa  (%rbx), %xmm0
 845 0094 4883C310              addq    $16, %rbx
 846 0098 4939DC                cmpq    %rbx, %r12
 847 009b 660F6FC8              movdqa  %xmm0, %xmm1
 848 009f 660F66CC              pcmpgtd %xmm4, %xmm1
 849 00a3 660FDBC1              pand    %xmm1, %xmm0
 850 00a7 660F6FCD              movdqa  %xmm5, %xmm1
 851 00ab 660F6FD8              movdqa  %xmm0, %xmm3
 852 00af 660F66C8              pcmpgtd %xmm0, %xmm1
 853 00b3 660F62D9              punpckldq       %xmm1, %xmm3
 854 00b7 660F6AC1              punpckhdq       %xmm1, %xmm0
 855 00bb 660FD4D3              paddq   %xmm3, %xmm2
 856 00bf 660FD4D0              paddq   %xmm0, %xmm2
 857 00c3 75CB                  jne     .L73
 858 00c5 660F6FC2              movdqa  %xmm2, %xmm0
 859 00c9 660F73D8              psrldq  $8, %xmm0
 859      08
 860 00ce 660FD4C2              paddq   %xmm2, %xmm0
 861 00d2 66480F7E              movq    %xmm0, %rax
 861      C0
 862 00d7 4901C6                addq    %rax, %r14
 863                    .LVL109:
 864                    .LBE548:
  22:branch_prediction_victim.cpp ****     for (int i = 0; i < 100000; ++i)
 865                            .loc 5 22 0
 866 00da 83EA01                subl    $1, %edx
 867 00dd 75A1                  jne     .L74
 868 00df 0F292424              movaps  %xmm4, (%rsp)
 869                    .LBE547:
  25:branch_prediction_victim.cpp ****                 sum += array_aka[j];

I also checked the assembly of the without -O3

 174                .LBB3:
  22:branch_prediction_victim.cpp ****     for (int i = 0; i < 100000; ++i)
 175                    .loc 3 22 0
 176 00a4 C78580FF      movl    $0, -131200(%rbp)
 176      FDFF0000 
 176      0000
 177                .L15:
 178                    .loc 3 22 0 is_stmt 0 discriminator 1
 179 00ae 81BD80FF      cmpl    $99999, -131200(%rbp)
 179      FDFF9F86 
 179      0100
 180 00b8 7F55          jg  .L11
 181                .LBB4:
 182                .LBB5:
  23:branch_prediction_victim.cpp ****         for (size_t j = 0; j < array_length; ++j)
 183                    .loc 3 23 0 is_stmt 1
 184 00ba 48C785A8      movq    $0, -131160(%rbp)
 184      FFFDFF00 
 184      000000
 185                .L14:
 186                    .loc 3 23 0 is_stmt 0 discriminator 1
 187 00c5 4881BDA8      cmpq    $32767, -131160(%rbp)
 187      FFFDFFFF 
 187      7F0000
 188 00d0 7734          ja  .L12
  24:branch_prediction_victim.cpp ****             if (array_aka[j] >= 128)
 189                    .loc 3 24 0 is_stmt 1
 190 00d2 488B85A8      movq    -131160(%rbp), %rax
 190      FFFDFF
 191 00d9 8B8485F0      movl    -131088(%rbp,%rax,4), %eax
 191      FFFDFF
 192 00e0 83F87F        cmpl    $127, %eax
 193 00e3 7E17          jle .L13
  25:branch_prediction_victim.cpp ****                 sum += array_aka[j];
 194                    .loc 3 25 0
 195 00e5 488B85A8      movq    -131160(%rbp), %rax
 195      FFFDFF
 196 00ec 8B8485F0      movl    -131088(%rbp,%rax,4), %eax
 196      FFFDFF
 197 00f3 4898          cltq
 198 00f5 480185A0      addq    %rax, -131168(%rbp)
 198      FFFDFF
 199                .L13:
  23:branch_prediction_victim.cpp ****         for (size_t j = 0; j < array_length; ++j)
 200                    .loc 3 23 0 discriminator 2
 201 00fc 488385A8      addq    $1, -131160(%rbp)
 201      FFFDFF01 
 202 0104 EBBF          jmp .L14
 203                .L12:
 204                .LBE5:
 205                .LBE4:
  22:branch_prediction_victim.cpp ****         for (size_t j = 0; j < array_length; ++j)
 206                    .loc 3 22 0 discriminator 2
 207 0106 838580FF      addl    $1, -131200(%rbp)
 207      FDFF01
 208 010d EB9F          jmp .L15
 209                .L11:
 210                .LBE3:

there is too much difference

what is the logic here with so much xmm things?
why such things can speed up performance?
It seems __builtin_expect hardly has any effect, so why there is likely aka __builtin_expect everywhere in Linux kernel?
without -O3, we can see for (0..100k) and for (0..arr_len) two loops they are separate, I'm sure there is 100k * arr_len loops then.
For the -O3 version for (0..100k) and for (0..arr_len) are together, so there is only 100k loops?

It's auto-vectorizing with SSE2 SIMD to compare 4 elements at once. (But then unpacks that 4x32-bit vector to two vectors of 2x64-bit for two `paddq`). Notice the `addq $16, %rbx` pointer increment. If you want to look at optimized scalar code, use `-O2` or `-O3 -fno-tree-vectorize`. Of course all the `-O3` times are the same; they all compile the same way, to branchless SIMD asm. Unless you stop it, GCC decides that vectorizing is even better than taking advantage of the hint that the branch is predictable. (That might barely be true the clunky way it vectorizes, though...) — Peter Cordes, Apr 05 '20 at 12:14
@PeterCordes, thank you, awesome, for the `jne .L73` thing, there was `100000 * array_length` cycles, but now there are only `100000` cycles?, it processed `array_length` so many elements at once or just 4? — http8086, Apr 05 '20 at 12:38
xmm registers are 128-bit long, so it can contain only 4 `int`s at once. If you compile for AVX2 or AVX512 then you'll be able to operate on 8 or 16 `int`s at the same time — phuclv, Apr 05 '20 at 12:49
@workplaylifecycle: What do you mean exactly? What two things are you comparing to see `100k` cycles vs. `100k * arr_len`? The loop trip count is only a factor of 4 lower than it would be for scalar; notice the pointer increment amount. And it's not doing loop interchange to defeat your repeat loop; that loop over the array is still the inner loop. Are you sure you're not just looking at the outer loop branch which is reached repeat-count times? — Peter Cordes, Apr 05 '20 at 12:53
you can't use floating-point or vectorized code in kernel (which uses those [xmm/ymm/zmm registers](https://stackoverflow.com/q/52932539/995714)) therefore scalar code must be used, which means `__builtin_expect` is necessary, except in a few situations where [CMOVcc](https://www.felixcloutier.com/x86/cmovcc) or [SWAR](https://en.wikipedia.org/wiki/SWAR) can be used — phuclv, Apr 05 '20 at 12:54
@PeterCordes, thanks, I just need to make it clear first, without `-O3` there was `100k * array_len` loops, how many now? — http8086, Apr 05 '20 at 13:01
@phuclv: Usually `__builtin_expect` is *not* necessary for cases like this; the compiler can ideally figure out whether to use CMOV or a branch on its own (at least with PGO, which is a problem for a kernel...). The main use in the kernel is to stop the compiler from guessing wrong about which path is the common fast path, and which is the error-handling rare case that doesn't need to be fast. A *lot* of code in the kernel needs to check permissions or whatever, but working programs very rarely make system calls that return errors, compared to the amount that succeed. — Peter Cordes, Apr 05 '20 at 13:05
@workplaylifecycle: There's no point adding the un-optimized `-O0` asm to your question. Like I answered on [How to understand macro \`likely\` affecting branch prediction?](https://stackoverflow.com/q/61030543), un-optimized code is useless and in this specific case just added more instructions before doing the same branch. The main speedup with `-O3` is just from not making useless debug code that keeps everything in memory instead of registers. Vectorization is maybe giving a factor of 2 at best, probably less, because GCC doesn't do a good job. — Peter Cordes, Apr 05 '20 at 13:45
All of your tests have either been branchless vectorized code, or garbage unoptimized code that branches the same way, and probably doesn't benefit at all from `likely` because you told the compiler not to optimize. (`-O0` is the default, and you left that setting at the default.) So it's no surprise that `likely` has negligible effect. — Peter Cordes, Apr 05 '20 at 13:49
ya, just did some research about automatic vectorization, I am wondering compare 4 element at once is like compare `array[0]-[1]-[2]-[3]` to `128-128-128-128`, is something like this?, but in the 4 elements, what if some of them less than 128 but others greater than 128? — http8086, Apr 05 '20 at 15:10
Look at compiler output to see what vector constants the compiler is putting in memory. But yes, something like that for `array[i + 0..3]` against a vector of 127 probably. SSE/AVX before AVX512 only has two integer compare choices: equal or signed greater-than, not greater-or-equal. *but in the 4 elements, what if some of them less than 128 but others greater than 128?* That's why it ANDs with the compare result, to mask those elements to `0`, which conveniently is the additive identity element. So it's like doing `sum += x >= 128 ? x : 0`, always doing the add but maybe with 0. — Peter Cordes, Apr 05 '20 at 18:03
And BTW, I had thought GCC was being inefficient but actually it's not that bad. I thought it was just *counting* numbers that meet a condition, but it's actually summing them. It has to sign-extend them without SSE4.1 `pmovsx`. That's why it needs a 2nd `pcmpgtd` (against `0` I assume) to get high halves to unpack with. It would be more efficient if you had an array of unsigned, and much more efficient with a 32-bit `unsigned` sum that could wrap. — Peter Cordes, Apr 05 '20 at 18:06
Thought about this again; it actually is a missed optimization. The condition is `x >= 128` so we're never adding negative integers. So sign-extension simplifies to zero-extension, but it seems GCC's value-range propagation doesn't take advantage of it :/ Also, even better than unpacking might be extended-precision using the a 2nd vector of 32-bit elements as the high halves of the counter vector. If we can cheaply detect carry-out from `paddd` and add that 0 / 1 (or more likely subtract that 0 / -1) into another vector, we can wait until the end to unpack and do 64-bit addition. — Peter Cordes, Apr 05 '20 at 20:10

How does `g++ -O3` generated `xmm` things affect branch prediction for unsorted array?

0 Answers0