Finding the most frequently occurring element in an SSE register

Question

Does anyone have any thoughts on how to calculate the mode (statistic) of a vector of 8-bit integers in SSE4.x? To clarify, this would be 16x8-bit values in a 128-bit register.

I want the result as a vector mask which selects the mode-valued elements. i.e. the result of _mm_cmpeq_epi8(v, set1(mode(v))), as well as the scalar value.

Providing some additional context; while the above problem is an interesting one to solve in its own right, I have been through most algorithms I can think of with linear complexity. This class will wipe out any gains I can get from calculating this number.

I hope to engage you all in searching for some deep magic, here. It's possible that an approximation may be necessary to break this bound, such as "select a frequently occurring element" for example (N.B. difference against the most), which would be of merit. A probabilistic answer would be usable, too.

SSE and x86 have some very interesting semantics. It may be worth exploring a superoptimization pass.

To find mode, one need to calculate histogram first. But it's hard to effectively vectorize histogram calculation due to memory collisions and dependencies. Possible approach: https://stackoverflow.com/questions/12985949/ — MBo, Aug 03 '17 at 06:47
Hm... Perhaps you could compute how often each element occurs by broadcasting that element to each member of a new vector, then `pcmp` with the original vector and a horizontal sum to get the element counts. Now find the maximum of that using `pmaxub`. Doesn't seem to be too fast though. — fuz, Aug 03 '17 at 09:24
@fuz Fortunately the horizontal sum is not necessary. A much more efficient vertical sum can be used here. See my answer below. — wim, Aug 03 '17 at 22:12
@MBo The set of values is very small in this case (only 16 elements) and fits within an SSE register. This simplifies the problem a lot; we don't have to worry about memory collisions and dependencies, as long as the computations are done within the SSE registers, see my [answer](https://stackoverflow.com/a/45492545/2439725) below. — wim, Aug 04 '17 at 21:17
@Martin Källman Interesting question. Note that a less mathematical title of your question might help a bit to attract more more attention for your question. For example: 'How to calculate efficiently the 8-bit integer that appears most often within a 16x8-bit SSE register'. — wim, Aug 04 '17 at 21:19
Where do you want the result? Value in an integer register? Value broadcast to bytes in an XMM register? Value zero-extended to 32 bits in the bottom of an XMM register? Or do you want an index (position) instead of the actual value? — Peter Cordes, Aug 06 '17 at 04:31
@PeterCordes in my specific case I will use it to create a mask for the mode-valued elements, for other purposes — awdz9nld, Aug 06 '17 at 18:14
You should have said that in the first place. @wim's answer can get a mode-mask more efficiently than it can actually get the mode *value*, with a packed-compare against the match-count vectors. — Peter Cordes, Aug 06 '17 at 18:24
I know I sound like a broken record, but what is the distribution of the input elements and the expected mode? Are all 256 possible answers equally likely, or is the mode strongly biased towards a few values? "Equally likely" seems ... unlikely ... since the result would almost always be a tie with frequency == 1. and don't have any special insight for the "equally likely" version anyway beyond what has already been written here. — BeeOnRope, Aug 06 '17 at 20:53
... if there is a strong enough bias, on the other hand, there are fast approaches available, such as explicitly comparing against the few likely modes and counting the number of hits. If the highest frequency in the checked values > `16 - sum(checked frequencies)` then you have found your mode quickly and can splat it out and do a `cmpeq` to get your mask. Depending on the bias in the answers you might want to do this branchy, or branch free. If the mode is not one of the expected values you can fall back to a slower approach. — BeeOnRope, Aug 06 '17 at 20:55
@BeeOnRope the distribution is unknown. either way I need to be able to find the mode in such a way that the procedure can be repeatedly applied, in order to preserve correctness — awdz9nld, Aug 06 '17 at 21:45
Indeed, what I describe above can be repeatedly applied and is correct. I should clarify that you don't need to know the actual distribution (i.e,. which values are common), but just the general "shape", i.e., is it skewed or not. In any case, if you know nothing about your input data (highly unusual?) you can always determine this at runtime - use the fallback algorithm initially and simply periodically check if the modes are highly concentrated, and at that point switch to the concentrated distribution algorithm. @MartinKällman — BeeOnRope, Aug 06 '17 at 21:47
@BeeOnRope the distribution can be completely arbitrary - it's hard to make any assertions about it. to be able to guarantee correctness with an explicit search, we would have to resort to either exhaustive enumeration or branching. either way, I'm not able to say anything about the distribution, I'm afraid :( — awdz9nld, Aug 06 '17 at 21:51
@MartinKällman You don't need to make any particular assertion about the distribution for _correctness_, but only for performance. Certainly you have actually run your application and know something about the data? Even if not, as I mentioned above you can simply accommodate the case where the data is skewed with the occasional cheap switch at runtime based on the observed distribution. Those are only _heuristics_ and **don't affect correctness**. For example, I'm not following what you mean about exhaustive enumeration? — BeeOnRope, Aug 06 '17 at 21:54
To clarify what I mean, let's say you have 3 values, `a, b, c` which are commonly the mode, and their occurrence count in some 16 byte sample is `f(a)`, `f(b)` and `f(b)`, and then let `m = max(f(a), f(b), f(c))`. If `m >= 16 - (f(a) + f(b) + f(c))`, then you are guaranteed that whatever of a, b, c had `m` occurrences is the mode (or tied for the mode) and you don't need to do any more work (other than generating your mask, which is easy). Otherwise, fall back to the other solutions below. This is always _correct_... — BeeOnRope, Aug 06 '17 at 21:58

score 5 · Answer 1 · edited Aug 10 '17 at 11:19

Sort the data in the register. Insertion sort can be done in 16 (15) steps, by initializing the register to "Infinity", which tries to illustrate a monotonically decreasing array and inserting the new element in parallel to all possible places:

// e.g. FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF 78
__m128i sorted = _mm_or_si128(my_array, const_FFFFF00);

for (int i = 1; i < 16; ++i)
{
    // Trying to insert e.g. A0, we must shift all the FF's to left
    // e.g. FF FF FF FF FF FF FF FF FF FF FF FF FF FF 78 00
    __m128i shifted = _mm_bslli_si128(sorted, 1);

    // Taking the MAX of shifted and 'A0 on all places'
    // e.g. FF FF FF FF FF FF FF FF FF FF FF FF FF FF A0 A0
    shifted = _mm_max_epu8(shifted, _mm_set1_epi8(my_array[i]));

    // and minimum of the shifted + original --
    // e.g. FF FF FF FF FF FF FF FF FF FF FF FF FF FF A0 78
    sorted = _mm_min_epu8(sorted, shifted);
}

Then calculate mask for vec[n+1] == vec[n], move mask to GPR and use that to index a 32768 entry LUT for best index location.

In real case one probably want's to sort more than just one vector; i.e. sort 16 16-entry vectors at once;

__m128i input[16];      // not 1, but 16 vectors
transpose16x16(input);  // inplace vector transpose
sort(transpose);        // 60-stage network exists for 16 inputs
// linear search -- result in 'mode'
__m128i mode = input[0];
__m128i previous = mode;
__m128i count = _mm_set_epi8(0);
__m128i max_count = _mm_setzero_si128(0);
for (int i = 1; i < 16; i++)
{
   __m128i &current = input[i];
   // histogram count is off by one
   // if (current == previous) count++;
   //    else count = 0;
   // if (count > max_count)
   //    mode = current, max_count = count
   prev = _mm_cmpeq_epi8(prev, current);
   count = _mm_and_si128(_mm_sub_epi8(count, prev), prev);
   __m128i max_so_far = _mm_cmplt_epi8(max_count, count);
   mode = _mm_blendv_epi8(mode, current, max_so_far);
   max_count = _mm_max_epi8(max_count, count);
   previous = current;
}

The inner loop totals amortized cost of 7-8 instructions per result; Sorting has typically 2 instructions per stage -- i.e. 8 instructions per result, when 16 results need 60 stages or 120 instructions. (This still leaves the transpose as an exercise -- but I think it should be vastly faster than sorting?)

So, this should be in the ball park of 24 instructions per 8-bit result.

Probably a bit more better balance of operations is possible with 7 insertion stages (to low 8 and high 8 values simultaneously) followed by bitonic merge 8,4,2 and 1... — Aki Suihkonen, Aug 03 '17 at 13:31
that's a pretty neat insertion sort :) sorting was the first thing that came to my mind, but its linear complexity is going to nullify any gains in my specific case, unfortunately :( — awdz9nld, Aug 03 '17 at 13:59
A byte-element transpose of 16 vectors is going to be expensive (non-trivial compared to sorting), I think. With only 16 register (until AVX512), you're going to have to spill some. Besides that, you have a lot of work to do with 16 elements per vector. `shufps` is a nice 2-input shuffle, but only works in chunks of 4 elements. `pblendvb` is 2 uops that can only run on port5 on Intel HSW/BDW. (Not as bad on other uarches; Skylake can run the non-AVX version (blend control in `xmm0`) in 1 uop for any port, or the AVX-128 version in 2 uops for any port.) — Peter Cordes, Aug 06 '17 at 04:48

wim · Accepted Answer · 2017-08-06T22:49:42.793

Probably a relatively simple brute force SSEx approach is suitable here, see the code below. The idea is to byte-rotate the input vector v by 1 to 15 positions and compare the rotated vector with the original v for equality. To shorten the dependency chain and to increase the instruction level parallelism, two counters are used to count (vertical sum) these equal elements: sum1 and sum2, because there might be architectures that benefit from this. Equal elements are counted as -1. Variable sum = sum1 + sum2 contains the total count with values between -1 and -16. min_brc contains the horizontal minimum of sum broadcasted to all elements. mask = _mm_cmpeq_epi8(sum,min_brc) is the mask for the mode-valued elements requested as an intermediate result by the OP. In the next few lines of the code the actual mode is extracted.

This solution is certainly faster than a scalar solution. Note that with AVX2 the upper 128-bit lanes can be used to speedup the computation further.

It takes 20 cycles (throughput) to compute only the a mask for the mode-valued elements. With the actual mode broadcasted across the SSE register it takes about 21.4 cycles.

Note the behaviour in the next example: [1, 1, 3, 3, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16] returns mask=[-1,-1,-1,-1,0,0,...,0] and the mode value is 1, although 1 occurs as often as 3.

The code below is tested, but not thoroughly tested

#include <stdio.h>
#include <x86intrin.h>
/*  gcc -O3 -Wall -m64 -march=nehalem mode_uint8.c   */
int print_vec_char(__m128i x);

__m128i mode_statistic(__m128i v){
    __m128i  sum2         = _mm_set1_epi8(-1);                    /* Each integer occurs at least one time */
    __m128i  v_rot1       = _mm_alignr_epi8(v,v,1);
    __m128i  v_rot2       = _mm_alignr_epi8(v,v,2);
    __m128i  sum1         =                   _mm_cmpeq_epi8(v,v_rot1);
             sum2         = _mm_add_epi8(sum2,_mm_cmpeq_epi8(v,v_rot2));

    __m128i  v_rot3       = _mm_alignr_epi8(v,v,3);
    __m128i  v_rot4       = _mm_alignr_epi8(v,v,4);
             sum1         = _mm_add_epi8(sum1,_mm_cmpeq_epi8(v,v_rot3));
             sum2         = _mm_add_epi8(sum2,_mm_cmpeq_epi8(v,v_rot4));

    __m128i  v_rot5       = _mm_alignr_epi8(v,v,5);
    __m128i  v_rot6       = _mm_alignr_epi8(v,v,6);
             sum1         = _mm_add_epi8(sum1,_mm_cmpeq_epi8(v,v_rot5));
             sum2         = _mm_add_epi8(sum2,_mm_cmpeq_epi8(v,v_rot6));

    __m128i  v_rot7       = _mm_alignr_epi8(v,v,7);
    __m128i  v_rot8       = _mm_alignr_epi8(v,v,8);
             sum1         = _mm_add_epi8(sum1,_mm_cmpeq_epi8(v,v_rot7));
             sum2         = _mm_add_epi8(sum2,_mm_cmpeq_epi8(v,v_rot8));

    __m128i  v_rot9       = _mm_alignr_epi8(v,v,9);
    __m128i  v_rot10      = _mm_alignr_epi8(v,v,10);
             sum1         = _mm_add_epi8(sum1,_mm_cmpeq_epi8(v,v_rot9));
             sum2         = _mm_add_epi8(sum2,_mm_cmpeq_epi8(v,v_rot10));

    __m128i  v_rot11      = _mm_alignr_epi8(v,v,11);
    __m128i  v_rot12      = _mm_alignr_epi8(v,v,12);
             sum1         = _mm_add_epi8(sum1,_mm_cmpeq_epi8(v,v_rot11));
             sum2         = _mm_add_epi8(sum2,_mm_cmpeq_epi8(v,v_rot12));

    __m128i  v_rot13      = _mm_alignr_epi8(v,v,13);
    __m128i  v_rot14      = _mm_alignr_epi8(v,v,14);
             sum1         = _mm_add_epi8(sum1,_mm_cmpeq_epi8(v,v_rot13));
             sum2         = _mm_add_epi8(sum2,_mm_cmpeq_epi8(v,v_rot14));

    __m128i  v_rot15      = _mm_alignr_epi8(v,v,15);
             sum1         = _mm_add_epi8(sum1,_mm_cmpeq_epi8(v,v_rot15));
    __m128i  sum          = _mm_add_epi8(sum1,sum2);                      /* Sum contains values such as -1, -2 ,...,-16                                    */
                                                                          /* The next three instructions compute the horizontal minimum of sum */
    __m128i  sum_shft     = _mm_srli_epi16(sum,8);                        /* Shift right 8 bits, while shifting in zeros                                    */
    __m128i  min1         = _mm_min_epu8(sum,sum_shft);                   /* sum and sum_shuft are considered as unsigned integers. sum_shft is zero at the odd positions and so is min1 */ 
    __m128i  min2         = _mm_minpos_epu16(min1);                       /* Byte 0 within min2 contains the horizontal minimum of sum                      */
    __m128i  min_brc      = _mm_shuffle_epi8(min2,_mm_setzero_si128());   /* Broadcast horizontal minimum                                                   */

    __m128i  mask         = _mm_cmpeq_epi8(sum,min_brc);                  /* Mask = -1 at the byte positions where the value of v is equal to the mode of v */

    /* comment next 4 lines out if there is no need to broadcast the mode value */
    int      bitmask      = _mm_movemask_epi8(mask);
    int      indx         = __builtin_ctz(bitmask);                            /* Index of mode                            */
    __m128i  v_indx       = _mm_set1_epi8(indx);                               /* Broadcast indx                           */
    __m128i  answer       = _mm_shuffle_epi8(v,v_indx);                        /* Broadcast mode to each element of answer */ 

/* Uncomment lines below to print intermediate results, to see how it works. */
//    printf("sum         = ");print_vec_char (sum           );
//    printf("sum_shft    = ");print_vec_char (sum_shft      );
//    printf("min1        = ");print_vec_char (min1          );
//    printf("min2        = ");print_vec_char (min2          );
//    printf("min_brc     = ");print_vec_char (min_brc       );
//    printf("mask        = ");print_vec_char (mask          );
//    printf("v_indx      = ");print_vec_char (v_indx        );
//    printf("answer      = ");print_vec_char (answer        );

             return answer;   /* or return mask, or return both ....    :) */
}


int main() {
    /* To test throughput set throughput_test to 1, otherwise 0    */
    /* Use e.g. perf stat -d ./a.out to test throughput           */
    #define throughput_test 0

    /* Different test vectors  */
    int i;
    char   x1[16] = {5, 2, 2, 7, 21, 4, 7, 7, 3, 9, 2, 5, 4, 3, 5, 5};
    char   x2[16] = {5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5};
    char   x3[16] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16};
    char   x4[16] = {1, 2, 3, 2, 1, 6, 7, 8, 2, 2, 2, 3, 3, 2, 15, 16};
    char   x5[16] = {1, 1, 3, 3, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16};

    printf("\n15...0      =   15  14  13  12    11  10  9   8     7   6   5   4     3   2   1   0\n\n");

    __m128i  x_vec  = _mm_loadu_si128((__m128i*)x1);

    printf("x_vec       = ");print_vec_char(x_vec        );

    __m128i  y      = mode_statistic (x_vec);

    printf("answer      = ");print_vec_char(y         );


    #if throughput_test == 1
    __m128i  x_vec1  = _mm_loadu_si128((__m128i*)x1);
    __m128i  x_vec2  = _mm_loadu_si128((__m128i*)x2);
    __m128i  x_vec3  = _mm_loadu_si128((__m128i*)x3);
    __m128i  x_vec4  = _mm_loadu_si128((__m128i*)x4);
    __m128i  x_vec5  = _mm_loadu_si128((__m128i*)x5);
    __m128i  y1, y2, y3, y4, y5;
    __asm__ __volatile__ ( "vzeroupper" : : : );   /* Remove this line on non-AVX processors */
    for (i=0;i<100000000;i++){
        y1       = mode_statistic (x_vec1);
        y2       = mode_statistic (x_vec2);
        y3       = mode_statistic (x_vec3);
        y4       = mode_statistic (x_vec4);
        y5       = mode_statistic (x_vec5);
        x_vec1   = mode_statistic (y1    );
        x_vec2   = mode_statistic (y2    );
        x_vec3   = mode_statistic (y3    );
        x_vec4   = mode_statistic (y4    );
        x_vec5   = mode_statistic (y5    );
     }
    printf("mask mode   = ");print_vec_char(y1           );
    printf("mask mode   = ");print_vec_char(y2           );
    printf("mask mode   = ");print_vec_char(y3           );
    printf("mask mode   = ");print_vec_char(y4           );
    printf("mask mode   = ");print_vec_char(y5           );
    #endif

    return 0;
}



int print_vec_char(__m128i x){
    char v[16];
    _mm_storeu_si128((__m128i *)v,x);
    printf("%3hhi %3hhi %3hhi %3hhi | %3hhi %3hhi %3hhi %3hhi | %3hhi %3hhi %3hhi %3hhi | %3hhi %3hhi %3hhi %3hhi\n",
           v[15],v[14],v[13],v[12],v[11],v[10],v[9],v[8],v[7],v[6],v[5],v[4],v[3],v[2],v[1],v[0]);
    return 0;
}

Output:

15...0      =   15  14  13  12    11  10  9   8     7   6   5   4     3   2   1   0

x_vec       =   5   5   3   4 |   5   2   9   3 |   7   7   4  21 |   7   2   2   5
sum         =  -4  -4  -2  -2 |  -4  -3  -1  -2 |  -3  -3  -2  -1 |  -3  -3  -3  -4
min_brc     =  -4  -4  -4  -4 |  -4  -4  -4  -4 |  -4  -4  -4  -4 |  -4  -4  -4  -4
mask        =  -1  -1   0   0 |  -1   0   0   0 |   0   0   0   0 |   0   0   0  -1
answer      =   5   5   5   5 |   5   5   5   5 |   5   5   5   5 |   5   5   5   5

The horizontal minimum is computed with Evgeny Kluev's method.

If you start with set1(255), and add the `-1` compare result for each match, I think you can do the bit at the end with `phminposuw`. It operates on words, so you need to `pmovzxbw` the low half, and `punpckhbw` the high half with zeros. Then `phminposuw` will give you the index of the lowest counter from each half. Err, then you run into the problem of deciding which half to take. `pminuw` and use `pcmpeqw` to find out which vector's count value was already the minimum, then use that compare result to blend `index_low` and `index_high+8`. That will be an index into your original vector. — Peter Cordes, Aug 06 '17 at 05:00
That didn't end up as efficient as I thought; it may not be much better than your sequence after the printf, which I think is solving the same problem. — Peter Cordes, Aug 06 '17 at 05:01
You're effectively starting your counters at 0, which is fine for your end code, but `phminposuw` sees 0 as less than `-3` = unsigned `253`. You skip comparing each element against itself, which would give singleton elements a count of -1. But of course, `pcmpeqb same,same` is a recognized dependency-breaking idiom for generating all-ones. :P Anyway, the `phminposuw` cleanup adds 2 instructions to overall cost for initializing the counters to all-ones. — Peter Cordes, Aug 06 '17 at 05:10
@PeterCordes Indeed `_mm_minpos_epu16` is more suitable for 8-bit integers than I initially thought. See also [here](https://stackoverflow.com/a/22268350/2439725). I have to think about it. Note that the `min4_brc = _mm_shuffle_epi8(min4,_mm_setzero_si128())` in the code above is redundant because the minimum is already in each of the 16 elements. — wim, Aug 06 '17 at 09:00
Turns out what the OP actually needs is a mask of the mode-valued elements. I think we can get that more efficiently than the actual value, by broadcasting the mode's `sum` and comparing it to the sum vector. — Peter Cordes, Aug 06 '17 at 18:22
This makes `_mm_minpos_epu16` much more attractive: just minpos to get the lowest count value, then broadcast (the low byte) + pcmp that against the non-unpacked sum vector. None of that `index_low` vs. `index_high+8` selection is needed. With AVX2, you could use `vpbroadcastb` to save the need for an all-zero mask for `pshufb`, but that's pretty trivial. — Peter Cordes, Aug 06 '17 at 18:33
@PeterCordes I do need to obtain the value, too, which is obviously doable with the mask. either way, the mask and the value are equivalent in terms of the end result. I doubt that the difference will allow for gains that will be able to help (me) — awdz9nld, Aug 06 '17 at 21:49
@MartinKällman: But you don't need the mode extracted or broadcast, or anything, right? `v & mode_mask(v)` is only one more instruction (or maybe two if it takes an extra `movdqa` to not destroy the original `v`). — Peter Cordes, Aug 06 '17 at 22:00
@PeterCordes the mode will be extracted, yes, but that's of lesser concern at the moment ;) — awdz9nld, Aug 06 '17 at 22:01
@MartinKällman: Then please edit your question to say what results you do need. My edit left out the scalar value, since I thought you just needed the mask. The 16x `palignr`+`pcmpeqb`+`paddb` are most of the cost, but an efficient way to get a mask + scalar result is definitely useful. You could have the mask ready with lower latency than the scalar result, or vice versa, so the best choice could even depend on which one is on the critical path. — Peter Cordes, Aug 06 '17 at 22:07
@PeterCordes Indeed `phminposuw` is faster than the shift/min sequence. It saves about 2 cycles per function call. Note that the throughput is about 21.4 cycles. This looks reasonable with 18 port 5 operations. 3.71 instructions per cycle on skylake. — wim, Aug 06 '17 at 23:00
@PeterCordes Note that extracting the mode value takes 1.4 cycle (throughput) more than computing the mask only. I don't know if the OP is interested in low latency or in fast throughput, though. — wim, Aug 06 '17 at 23:22

huseyin tugrul buyukisik · Answer 3 · 2017-08-10T00:37:58.553

0

For performance comparison with scalar code. Non-vectorized on main part but vectorized on table-clear and tmp initialization. (168 cycles per f() call for fx8150 (22M calls complete in 1.0002 seconds at 3.7 GHz))

#include <x86intrin.h>

unsigned char tmp[16]; // extracted values are here (single instruction, store_ps)
unsigned char table[256]; // counter table containing zeroes
char f(__m128i values)
{
    _mm_store_si128((__m128i *)tmp,values);
    int maxOccurence=0;
    int currentValue=0;
    for(int i=0;i<16;i++)
    {
        unsigned char ind=tmp[i];
        unsigned char t=table[ind];
        t++;
        if(t>maxOccurence)
        {
             maxOccurence=t;
             currentValue=ind;
        }
        table[ind]=t;
    }
    for(int i=0;i<256;i++)
        table[i]=0;
    return currentValue;
}

g++ 6.3 output:

f:                                      # @f
        movaps  %xmm0, tmp(%rip)
        movaps  %xmm0, -24(%rsp)
        xorl    %r8d, %r8d
        movq    $-15, %rdx
        movb    -24(%rsp), %sil
        xorl    %eax, %eax
        jmp     .LBB0_1
.LBB0_2:                                # %._crit_edge
        cmpl    %r8d, %esi
        cmovgel %esi, %r8d
        movb    tmp+16(%rdx), %sil
        incq    %rdx
.LBB0_1:                                # =>This Inner Loop Header: Depth=1
        movzbl  %sil, %edi
        movb    table(%rdi), %cl
        incb    %cl
        movzbl  %cl, %esi
        cmpl    %r8d, %esi
        cmovgl  %edi, %eax
        movb    %sil, table(%rdi)
        testq   %rdx, %rdx
        jne     .LBB0_2
        xorps   %xmm0, %xmm0
        movaps  %xmm0, table+240(%rip)
        movaps  %xmm0, table+224(%rip)
        movaps  %xmm0, table+208(%rip)
        movaps  %xmm0, table+192(%rip)
        movaps  %xmm0, table+176(%rip)
        movaps  %xmm0, table+160(%rip)
        movaps  %xmm0, table+144(%rip)
        movaps  %xmm0, table+128(%rip)
        movaps  %xmm0, table+112(%rip)
        movaps  %xmm0, table+96(%rip)
        movaps  %xmm0, table+80(%rip)
        movaps  %xmm0, table+64(%rip)
        movaps  %xmm0, table+48(%rip)
        movaps  %xmm0, table+32(%rip)
        movaps  %xmm0, table+16(%rip)
        movaps  %xmm0, table(%rip)
        movsbl  %al, %eax
        ret

edited Aug 10 '17 at 00:37

answered Aug 09 '17 at 21:42

huseyin tugrul buyukisik

11,469
4
45
97

Even on an AMD CPU where int and SIMD ALUs don't compete for the same execution ports (like they do on Intel), the front-end is still a significant bottleneck for this version. If @wim's 21.4c per call throughput on Skylake is anywhere near what you'd get on your AMD Piledriver(?) , you'd need to interleave about 8 SIMD vectors into one scalar function. That's plausible: `palignr` throughput is the same 1 per clock on skylake and Bulldozer-family, and `paddb`/`pcmpeqb` run on 2 different ports. But the SIMD version still uses up 3/4 of your front-end bandwidth, except in the cleanup. – Peter Cordes Aug 10 '17 at 00:35
Anyway, mayb interleave @wim's SIMD version into the inner loop of this, so you do 16 SIMD per 1 integer. Or maybe unroll the inner loop by 2 before interleaving, and do 8 SIMD per 1 integer. That would only be good on Steamroller with multiple threads, where each core of a pair (that share a SIMD unit) can decode 4 instructions per clock. On Bulldozer/Piledriver, two threads per "cluster" get alternate cycles on one decoder, so there's little extra front-end bandwidth to keep the integer ALUs fed without starving the SIMD ALUs. – Peter Cordes Aug 10 '17 at 00:41
So you mean algorithm complexity is too much for too little gain(if any, only in amd)? – huseyin tugrul buyukisik Aug 10 '17 at 00:42
CPUs have more execution units than they can use all at the same time, because of front-end limitations. But whatever workload you are running, there are lots of execution units for that workload, often enough to use most of the front-end issue/rename capacity. See also https://en.wikipedia.org/wiki/Dark_silicon. – Peter Cordes Aug 10 '17 at 00:47
Bulldozer/Piledriver probably gain almost as little as Haswell/Skylake. Or without AVX, there are probably some extra MOVDQA instructions that will eat up the remaining front-end bandwidth in @wim's version without needing ALU uops. Steamroller is actually interesting, though, because with 2 threads on a pair of cores, you have a lot more front-end bandwidth than the shared SIMD execution units can handle. – Peter Cordes Aug 10 '17 at 00:49
More registers, more instruction window(to re-order - rename)? Why is bulldoser-steamroller cache is much slower than Intel's? For latency? When I use avx, its stacking more instructions than as in sse since its coming from same avx unit right? – huseyin tugrul buyukisik Aug 10 '17 at 00:51
The issue/rename stage takes decoded uops and inserts them into the out-of-order part of the core, where they wait for their inputs to be ready. See http://www.realworldtech.com/haswell-cpu/3/. Making this part wider has significant power costs, and a lot of code doesn't have enough instruction-level parallelism for that to be worth it. (Ryzen is 6-wide, but decodes 256b AVX instructions to 2 uops each.) As for Bulldozer-family's slow cache, there's no simple answer. I forget if David Kanter mentions it in http://www.realworldtech.com/bulldozer/ – Peter Cordes Aug 10 '17 at 00:57
Wouldn't it be better if SIMD units had their own schedulers and we show them the area to be computed? That way they would execute N commands implicitly with 1 instruction(with locking that memory area against other threads), I mean, we give 256 instead of 4 _mm8192_add_ps() but it does it with 4-wide SIMD. How much memory bandwidth can instruction cache save? Also I wouldn't say no for extra 16 AVX registers instead of physically getting wider. Some of particle interaction functions can't be interleaved together because of memory consumption by too many particle properties. Is bulldoser=p4? – huseyin tugrul buyukisik Aug 10 '17 at 01:01
The source you linked says "TSX also supports multiple nested transactions, which requires hardware resources to track and disambiguate different levels of nesting. ". Is this about guessing branches and loading memory before reaching there? – huseyin tugrul buyukisik Aug 10 '17 at 01:12
No, TSX is about tracking what's been actual read and written since the start of a transaction, regardless of branching of prefetching. – Peter Cordes Aug 10 '17 at 01:14
What you're suggesting is [like old-style Cray vector machines](https://en.wikipedia.org/wiki/Vector_processor#Supercomputers) (where you specify the length, a bit like x86's `rep movs`), rather than the modern short-vector-register SIMD style. Related: [discussion on Agner Fog's blog forum](http://agner.org/optimize/blog/read.php?i=421#442) about designing a new ISA that could potentially replace x86. Various ideas about how to expose vectors for use by software were proposed, including some mention of Cray-style long vectors. – Peter Cordes Aug 10 '17 at 01:28
[AVX-512 4FMAPS and 4VNNIW](https://en.wikipedia.org/wiki/AVX-512#New_instructions_in_AVX-512_4FMAPS_and_4VNNIW) are taking steps in that direction, though: One instruction that does 4 load & FMA operations to 4 different destination registers (e.g. zmm8..zmm11). See [the details for `V4FMADDPS`](https://github.com/HJLebbink/asm-dude/wiki/V4FMADDPS). This is probably aimed for Xeon Phi, where instruction-decode is a huge bottleneck. This can get 4 uops of work into the pipeline for only one x86 instruction. (But that only helps with decode and I-cache space, not issue/rename bandwidth.) – Peter Cordes Aug 10 '17 at 01:35
`Packed/scalar single-precision floating point fused multiply-add and negate (4-iterations)` yes. This is better way imo. Nearly equivalent to 16-wide. Also some more avx registers wouldn't hurt. Why only 32 of them exists? Is that most expensive part? Not scalable? xmm[dynamic-index up to 64] would be awesome. – huseyin tugrul buyukisik Aug 10 '17 at 01:38
Unless some future x86 CPU can decode `V4FMADDPS` to fewer than 4 uops, which would be a MAJOR change from how things work now. Currently, a single uop can only write one output register. – Peter Cordes Aug 10 '17 at 01:38
32 registers takes 5 bits in the instruction, for each of dst, src1, and src2. Doubling that again would take another 3 bits per instruction. AVX512 already has 4-byte EVEX prefixes, but uses some of those bits for rounding-mode overrides, broadcast-loads, and vector-length (128/256/512). Also, 64 regs would double the size of the FPU architectural state that needs to be saved/restored on context switches. – Peter Cordes Aug 10 '17 at 01:43
But if it is working with "vector processing mode", it would be completely independent from other codes that use same register names. While vector processor using ymm[20], some other code in parallel, can use 20th since its locked and meant for vector processing so two ymm20 would have different meaning(ymm20 and ymm42) until vector ends. I mean, same names for different vector operations. Possible? – huseyin tugrul buyukisik Aug 10 '17 at 01:46
There's no such thing as "vector processing mode" in current CPUs. That would be a MASSIVE change to how they work internally. IDK how you think it could be independent of other code that uses the same architectural registers. – Peter Cordes Aug 10 '17 at 01:49
Yes, I think too that I talked nonesense. I accidentally thought with sql stored procedure style dream that somehow stored procedures in cpu than in parallel send requests(instructions), in its instruction cache explicitly. – huseyin tugrul buyukisik Aug 10 '17 at 01:51
That is a little bit like how Cray vector machines worked, I think. One instruction requests a big operation, and hardware with more internal parallelism / wider internal ALUs can run it faster. But they didn't have out-of-order execution, and *only* the vector stuff was fast. Letting future CPUs run the same vectorized code faster without recompiling is a hard problem (and a big part of the discussion on Agner Fog's forum about have a vector-length register or something in a hypothetical new ISA). – Peter Cordes Aug 10 '17 at 02:04

Finding the most frequently occurring element in an SSE register

3 Answers3

Linked