Fast dot product of a bit vector and a floating point vector

Question

I'm trying to compute the dot product between a floating and a bit vector in the most efficient manner on a i7. In reality, I'm doing this operation on either 128 or 256-dimensional vectors, but for illustration, let me write the code for 64-dimensions to illustrate the problem:

// a has 64 elements. b is a bitvector of 64 dimensions.
float dot(float *restrict a, uint64_t b) {
    float sum = 0;
    for(int i=0; b && i<64; i++, b>>=1) {
        if (b & 1) sum += a[i];
    }
    return sum;
}

This works, of course, but the problem is, this is the time-critical spot of the whole program (eats up the 95% CPU time of a 50 minutes run) so I desperately need to make it faster.

My guess is the branching above is the game killer (prevents out-of-order execution, causes bad branch prediction). I'm not sure if vector instructions could be used and helpful here. Using gcc 4.8 with -std=c99 -march=native -mtune=native -Ofast -funroll-loops, I'm currently getting this output

    movl    $4660, %edx
    movl    $5, %ecx
    xorps   %xmm0, %xmm0
    .p2align 4,,10
    .p2align 3
.L4:
    testb   $1, %cl
    je  .L2
    addss   (%rdx), %xmm0
.L2:
    leaq    4(%rdx), %rax
    shrq    %rcx
    testb   $1, %cl
    je  .L8
    addss   4(%rdx), %xmm0
.L8:
    shrq    %rcx
    testb   $1, %cl
    je  .L9
    addss   4(%rax), %xmm0
.L9:
    shrq    %rcx
    testb   $1, %cl
    je  .L10
    addss   8(%rax), %xmm0
.L10:
    shrq    %rcx
    testb   $1, %cl
    je  .L11
    addss   12(%rax), %xmm0
.L11:
    shrq    %rcx
    testb   $1, %cl
    je  .L12
    addss   16(%rax), %xmm0
.L12:
    shrq    %rcx
    testb   $1, %cl
    je  .L13
    addss   20(%rax), %xmm0
.L13:
    shrq    %rcx
    testb   $1, %cl
    je  .L14
    addss   24(%rax), %xmm0
.L14:
    leaq    28(%rax), %rdx
    shrq    %rcx
    cmpq    $4916, %rdx
    jne .L4
    ret

Edit It's okay to permute the data (as long as the permutation is the same for all parameters), the ordering doesn't matter.

I'm wondering if there is something that will work at >3x speed of Chris Dodd's SSE2 code.

New note: AVX/AVX2 code is also welcome!

Edit 2 Given a bitvector, I have to multiply it with 128 (or 256, if it's 256-bit) different float vectors (so it's also okay to involve more that a single float vector at a time). This is the whole process. Anything that will speed up the whole process is also welcome!

This seems like good code to rewrite for and execute on a GPU, or at the very least multithread — Patashu, Apr 17 '13 at 04:26
GPU solutions are also welcome :) But dot() needs to be executed many many times and fed back to the big CPU-side algorithm (can't imagine it could be written to run on GPU), meaning too many data transfer between GPU and CPU, which may cause a huge penalty. The program is already multi-threaded, running dot() in parallel BTW. — John Smith, Apr 17 '13 at 04:34
I was thinking about a flow: `xmm0=0; xmm1 = *next_4_floats++; xmm0 = conditional_copy(xmm1,mask); xmm2+=xmm0; mask<<=1;` In the end one has to do a single horizontal sum over xmm2 elements. Copying based on the mask would probably be one instruction shorter than forming 32-bit masks of all ones/zeros and using parallel and. OTOH one would need to preprocess the mask to order `vec_128 = |048c.....|159d....|26ae....|37bf....|` — Aki Suihkonen, Apr 17 '13 at 06:01
I'm sorry that I'm not really familiar with SSE, but what is conditional_copy and mask here? And does mask have anything to do with b? — John Smith, Apr 17 '13 at 06:09
The mask would be based on `b`; and the conditional copy instruction would be most likely `blendvps` from SSE4.1 — Aki Suihkonen, Apr 17 '13 at 08:02
It sounds like it would be even faster than the SSE2 solution below :) I'm a bit stuck about the mask though. — John Smith, Apr 17 '13 at 08:35
I have updated my solution with AVX2 code using VBLENDVPS. There are now 3 instructions in the kernel (the body of the "loop"). — Iwillnotexist Idonotexist, Nov 11 '13 at 00:42
Another way to describe this is a masked reduction (sum). AVX512 has masking built-in, e.g. with [`_mm256_mask_add_ps`](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=SSE,SSE2,SSE3,SSSE3,SSE4_1,SSE4_2,AVX,AVX2,AVX_512&expand=6150,6150,401,2424,4728,2796,131&othertechs=ADX&text=_mm256_mask_add_ps). It's literally one asm instruction to `vaddps` from a memory source, leaving elements unchanged where the bit in the mask register was unset. e.g. `vaddps zmm0{k1}, zmm0, [rdi]` / `kshift k1, k1, 16` is your loop body. — Peter Cordes, Apr 03 '19 at 11:30
See also [is there an inverse instruction to the movemask instruction in intel avx2?](//stackoverflow.com/q/36488675) for bitmask to vector mask. — Peter Cordes, Nov 07 '19 at 18:18

Chris Dodd · Answer 1 · 2013-04-17T17:34:08.397

16

The best bet is going to be to use SSE ps instructions that operate on 4 floats at a time. You can take advantage of the fact that a float 0.0 is all 0 bits to use an andps instruction to mask off the undesired elements:

#include <stdint.h>
#include <xmmintrin.h>

union {
    uint32_t i[4];
    __m128   xmm;
} mask[16] = {
 {  0,  0,  0,  0 },
 { ~0,  0,  0,  0 },
 {  0, ~0,  0,  0 },
 { ~0, ~0,  0,  0 },
 {  0,  0, ~0,  0 },
 { ~0,  0, ~0,  0 },
 {  0, ~0, ~0,  0 },
 { ~0, ~0, ~0,  0 },
 {  0,  0,  0, ~0 },
 { ~0,  0,  0, ~0 },
 {  0, ~0,  0, ~0 },
 { ~0, ~0,  0, ~0 },
 {  0,  0, ~0, ~0 },
 { ~0,  0, ~0, ~0 },
 {  0, ~0, ~0, ~0 },
 { ~0, ~0, ~0, ~0 },
};

float dot(__m128 *a, uint64_t b) {
    __m128 sum = { 0.0 };
    for (int i = 0; i < 16; i++, b>>=4)
        sum += _mm_and_ps(a[i], mask[b&0xf].xmm);
    return sum[0] + sum[1] + sum[2] + sum[3];
}

If you expect there to be a lot of 0s in the mask, it might be faster to short-cicruit the 0s:

for (int i = 0; b; i++, b >>= 4)
    if (b & 0xf)
        sum += _mm_and_ps(a[i], mask[b&0xf].xmm);

but if b is random, this will be slower.

edited Apr 17 '13 at 17:34

answered Apr 17 '13 at 06:50

Chris Dodd

119,907
13
134
226

1

Thanks! The benchmark results aren't striking, but the improvement is definitely there: runs ~86% time of the original. However, I'm still wondering if there's an even better answer though. – John Smith Apr 17 '13 at 07:38
GCC reports that it can't vectorize the loop. I skimmed through http://gcc.gnu.org/projects/tree-ssa/vectorization.html but couldn't find the reason so far. – John Smith Apr 17 '13 at 07:59
1

@JohnSmith: The loop is already vectorized (using explicit vector types and instructions), so there's no further vectorization to be done. With -O3, gcc completely unrolls the above loop, as well. – Chris Dodd Apr 17 '13 at 17:32
I see. I'm really not familiar with SSE :) – John Smith Apr 18 '13 at 01:11
@ChrisDodd: your code is a wonderful example for how different the assembler output for different compilers can become :-) The only variation I can see is to turn it into a map-reduce because then you won't have dependencies on subsequent `addps`. I doubt the effect is huge though. – FrankH. Apr 18 '13 at 12:40
You need `-1U`, not `-0`, for the all-ones mask entries. In C, `-0` is an integer zero. But yes, a LUT is good. A real version of this would need multiple accumulators to hide FP add latency, though. (Another factor of 3 to 8 speedup for large inputs. With the OP's 256-element arrays, at least 2 accumulators would be appropriate; the mask unpacking will take a bit of time so we maybe can't sustain 2-per-clock `addps` anyway.) – Peter Cordes Apr 03 '19 at 10:19
Ahh, apparently I was leaning back too far away from my monitor, and was sleepy and didn't think of that. – Peter Cordes Apr 03 '19 at 18:20
See also [is there an inverse instruction to the movemask instruction in intel avx2?](//stackoverflow.com/q/36488675) for alternatives to a LUT of masks. Although that's probably best for the 4-bit case. – Peter Cordes Nov 07 '19 at 18:19

Iwillnotexist Idonotexist · Answer 2 · 2014-01-22T14:50:06.837

To expand on Aki Suihkonen's answer, reshaping the bitstring is helpful for conditionally moving floats. In the solution below, a two-stage bit permutation using the SSE instructions PMOVMASKB and PSHUFB, plus the instruction BLENDVPS has been used to achieve 1.25 elements handled/cycle on a Core 2 Duo 2.26GHz, which is 20 times the speed of my reference C code.

[EDIT: An AVX2 implementation was added. Performance is unknown because I cannot test it myself, but is expected to be double the speed. ]

Here is my implementation and testbench, the explanation follows.

SSE4.1 (old, for AVX2 see below)

Code

/* Includes */
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <smmintrin.h>  /* SSE 4.1 */
#include <time.h>



/* Defines */
#define ALIGNTO(n) __attribute__((aligned(n)))
#define USE_PINSRW 1
#define NUM_ITERS  2260000



/**
 * Bit mask shuffle.
 * 
 * This version uses a loop to store eight u16 and reloads them as one __m128i.
 */

__m128 bitMaskShuffleStoreAndReload(__m128i mask){
    const __m128i perm    ALIGNTO(16) = _mm_set_epi8(15, 7, 14, 6, 13, 5, 12, 4,
                                                     11, 3, 10, 2, 9,  1, 8,  0);
    int i;
    uint16_t interMask[8] ALIGNTO(16);

    /* Shuffle bitmask */
        /* Stage 1 */
    for(i=7;i>=0;i--){
        interMask[i] = _mm_movemask_epi8(mask);
        mask = _mm_slli_epi32(mask, 1);
    }

        /* Stage 2 */
    return _mm_castsi128_ps(
                    _mm_shuffle_epi8(
                        _mm_load_si128((const __m128i*)interMask),
                        perm)
                );
}

/**
 * Bit mask shuffle.
 * 
 * This version uses the PINSTRW instruction.
 */

__m128 bitMaskShufflePINSRW(__m128i mask){
    const __m128i perm    ALIGNTO(16) = _mm_set_epi8(15, 7, 14, 6, 13, 5, 12, 4,
                                                     11, 3, 10, 2, 9,  1, 8,  0);
    __m128i  imask ALIGNTO(16);

    /* Shuffle bitmask */
        /* Stage 1 */
    imask = _mm_insert_epi16(imask, _mm_movemask_epi8(mask), 7);
    mask = _mm_slli_epi16(mask, 1);
    imask = _mm_insert_epi16(imask, _mm_movemask_epi8(mask), 6);
    mask = _mm_slli_epi16(mask, 1);
    imask = _mm_insert_epi16(imask, _mm_movemask_epi8(mask), 5);
    mask = _mm_slli_epi16(mask, 1);
    imask = _mm_insert_epi16(imask, _mm_movemask_epi8(mask), 4);
    mask = _mm_slli_epi16(mask, 1);
    imask = _mm_insert_epi16(imask, _mm_movemask_epi8(mask), 3);
    mask = _mm_slli_epi16(mask, 1);
    imask = _mm_insert_epi16(imask, _mm_movemask_epi8(mask), 2);
    mask = _mm_slli_epi16(mask, 1);
    imask = _mm_insert_epi16(imask, _mm_movemask_epi8(mask), 1);
    mask = _mm_slli_epi16(mask, 1);
    imask = _mm_insert_epi16(imask, _mm_movemask_epi8(mask), 0);

        /* Stage 2 */
    return _mm_castsi128_ps(
                    _mm_shuffle_epi8(
                        imask,
                        perm)
                );
}

/**
 * SSE 4.1 implementation.
 */

float dotSSE41(__m128 f[32], unsigned char maskArg[16]){
    int i, j, k;
    __m128i  mask      ALIGNTO(16) = _mm_load_si128((const __m128i*)maskArg);
    __m128   shufdMask ALIGNTO(16);
    __m128   zblended  ALIGNTO(16);
    __m128   sums      ALIGNTO(16) = _mm_setzero_ps();
    float    sumsf[4]  ALIGNTO(16);

    /* Shuffle bitmask */
#if USE_PINSRW
    shufdMask = bitMaskShufflePINSRW(mask);
#else
    shufdMask = bitMaskShuffleStoreAndReload(mask);
#endif

    /* Dot product */
    for(i=1;i>=0;i--){
        for(j=1;j>=0;j--){
            for(k=7;k>=0;k--){
                zblended  = _mm_setzero_ps();
                zblended  = _mm_blendv_ps(zblended, f[i*16+j+k*2], shufdMask);
                sums      = _mm_add_ps(sums, zblended);
                shufdMask = _mm_castsi128_ps(_mm_slli_epi32(_mm_castps_si128(shufdMask), 1));
            }
        }
    }

    /* Final Summation */
    _mm_store_ps(sumsf, sums);
    return sumsf[0] + sumsf[1] + sumsf[2] + sumsf[3];
}

/**
 * Reference C implementation
 */


float dotRefC(float f[128], unsigned char mask[16]){
    float sum = 0.0;
    int i;

    for(i=0;i<128;i++){
        sum += ((mask[i>>3]>>(i&7))&1) ? f[i] : 0.0;
    }

    return sum;
}

/**
 * Main
 */

int main(void){
    /* Variables */
        /* Loop Counter */
    int           i;

        /* Data to process */
    float         data[128] ALIGNTO(16);
    unsigned char mask[16]  ALIGNTO(16);
    float         refCSum, sseSum;

        /* Time tracking */
    clock_t       t1, t2, t3;
    double        refCTime, sseTime;



    /* Initialize mask and float arrays with some random data. */
    for(i=0;i<128;i++){
        if(i<16)
            mask[i]=rand();

        data[i] = random();
    }


    /* RUN TESTS */
    t1 = clock();
    for(i=0;i<NUM_ITERS;i++){
        refCSum = dotRefC(data, mask);
    }
    t2 = clock();
    for(i=0;i<NUM_ITERS;i++){
        sseSum  = dotSSE41((__m128*)data, mask);
    }
    t3 = clock();


    /* Compute time taken */
    refCTime = (double)(t2-t1)/CLOCKS_PER_SEC;
    sseTime  = (double)(t3-t2)/CLOCKS_PER_SEC;


    /* Print out results */
    printf("Results:\n"
           "RefC: Time: %f    Value: %f\n"
           "SSE:  Time: %f    Value: %f\n",
           refCTime, refCSum,
           sseTime,  sseSum);

    return 0;
}

Explanation

BLENDVPS uses the top bit in all four 32-bit lanes of the 128-bit register XMM0 to determine whether to move or not to move the value in the corresponding lane of its source operand into its destination operand. When loading data with MOVAPS, one gets 4 consecutive floats: For instance, the 8th, 9th, 10th and 11th floats. Of course, their selection or deselection must be controlled by the corresponding set of bits: For instance, the 8th, 9th, 10th and 11th bits in the bit string.

The problem is that when the mask is first loaded, the bits of these sets are right besides each other (in the 8th, 9th, 10th and 11th positions), when in fact they should be 32 bits apart; Remember, at some point they will have to occupy the the 31st bit position of each lane (the 31st, 63rd, 95th and 127th positions within the XMM0 register).

What ideally would happen is a bit transposition that brings bits 0, 4, 8, 12, ... in lane zero, bits 1, 5, 9, 13, ... in lane one, bits 2, 6, 10, 14, ... in lane two and bits 3, 7, 11, 15, ... in lane three. Thus all sets of 4 bits that were previously contiguous are now strided 32 bits apart, one in each of the four 32-bit lanes. Then all that it takes is a loop that iterates 32 times, each time shifting into the top bit position of each lane a new set of 4 bits.

Unfortunately x86 is not gifted with good bit manipulation instructions, so for lack of a clean way of doing a perfect transposition, a reasonable compromise is the one here.

In the mask, the 128 bits

  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
 16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31
 32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47
 48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63
 64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79
 80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95
 96  97  98  99 100 101 102 103 104 105 106 107 108 109 110 111
112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127

are permuted, by eight PMOVMASKB and eight PSLLW instructions, first to

  0   8  16  24  32  40  48  56  64  72  80  88  96 104 112 120
  1   9  17  25  33  41  49  57  65  73  81  89  97 105 113 121
  2  10  18  26  34  42  50  58  66  74  82  90  98 106 114 122
  3  11  19  27  35  43  51  59  67  75  83  91  99 107 115 123
  4  12  20  28  36  44  52  60  68  76  84  92 100 108 116 124
  5  13  21  29  37  45  53  61  69  77  85  93 101 109 117 125
  6  14  22  30  38  46  54  62  70  78  86  94 102 110 118 126
  7  15  23  31  39  47  55  63  71  79  87  95 103 111 119 127

and by a single PSHUFB instruction to

  0   8  16  24  32  40  48  56   4  12  20  28  36  44  52  60
 64  72  80  88  96 104 112 120  68  76  84  92 100 108 116 124
  1   9  17  25  33  41  49  57   5  13  21  29  37  45  53  61
 65  73  81  89  97 105 113 121  69  77  85  93 101 109 117 125
  2  10  18  26  34  42  50  58   6  14  22  30  38  46  54  62
 66  74  82  90  98 106 114 122  70  78  86  94 102 110 118 126
  3  11  19  27  35  43  51  59   7  15  23  31  39  47  55  63
 67  75  83  91  99 107 115 123  71  79  87  95 103 111 119 127

. We now iterate on four "runs" , each of which contains eight sets of four bits spread at intervals of 32 bits apart (as we desired), using these sets as the mask control for BLENDVPS. The inherent awkwardness of the bit shuffle leads to the awkward-looking triply-nested loop in dotSSE41(), but with

clang -Ofast -ftree-vectorize -finline-functions -funroll-loops -msse4.1 -mssse3 dot.c -o dottest

the loops are unrolled anyways. The inner loop iterations consist of 16 repeats of

blendvps    0x90(%rsi),%xmm1
addps       %xmm4,%xmm1
pslld       $0x1,%xmm2
movdqa      %xmm2,%xmm0
xorps       %xmm4,%xmm4

.

As an aside, I was unable to firmly pin down which of my two bit shuffle versions was fastest, so I gave both implementations in my answer.

AVX2 (new, but untested)

Code

/* Includes */
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <immintrin.h>  /* AVX2 */
#include <time.h>



/* Defines */
#define ALIGNTO(n) __attribute__((aligned(n)))
#define NUM_ITERS  2260000


/**
 * Bit mask shuffle.
 * 
 * This version uses the PINSTRW instruction.
 */

__m256 bitMaskShufflePINSRW(__m256i mask){
    __m256i  imask ALIGNTO(32);

    /* Shuffle bitmask */
    imask = _mm256_insert_epi32(imask, _mm256_movemask_epi8(mask), 7);
    mask = _mm256_slli_epi32(mask, 1);
    imask = _mm256_insert_epi32(imask, _mm256_movemask_epi8(mask), 6);
    mask = _mm256_slli_epi32(mask, 1);
    imask = _mm256_insert_epi32(imask, _mm256_movemask_epi8(mask), 5);
    mask = _mm256_slli_epi32(mask, 1);
    imask = _mm256_insert_epi32(imask, _mm256_movemask_epi8(mask), 4);
    mask = _mm256_slli_epi32(mask, 1);
    imask = _mm256_insert_epi32(imask, _mm256_movemask_epi8(mask), 3);
    mask = _mm256_slli_epi32(mask, 1);
    imask = _mm256_insert_epi32(imask, _mm256_movemask_epi8(mask), 2);
    mask = _mm256_slli_epi32(mask, 1);
    imask = _mm256_insert_epi32(imask, _mm256_movemask_epi8(mask), 1);
    mask = _mm256_slli_epi32(mask, 1);
    imask = _mm256_insert_epi32(imask, _mm256_movemask_epi8(mask), 0);

    /* Return bitmask */
    return _mm256_castsi256_ps(imask);
}

/**
 * AVX2 implementation.
 */

float dotAVX2(__m256 f[16], unsigned char maskArg[16]){
    int i, j, k;
    /* Use _mm_loadu_si128 */
    __m256i  mask      ALIGNTO(32) = _mm256_castsi128_si256(_mm_load_si128((const __m128i*)maskArg));
    __m256   shufdMask ALIGNTO(32);
    __m256   zblended  ALIGNTO(32);
    __m256   sums      ALIGNTO(32) = _mm256_setzero_ps();
    float    sumsf[8]  ALIGNTO(32);

    /* Shuffle bitmask */
    shufdMask = bitMaskShufflePINSRW(mask);
    shufdMask = _mm256_castsi256_ps(_mm256_slli_epi32(_mm256_castps_si256(shufdMask), 16));

    /* Dot product */
    for(i=15;i>=0;i--){
        zblended  = _mm256_setzero_ps();
        /* Replace f[i] with _mm256_loadu_ps((float*)&f[i]) */
        zblended  = _mm256_blendv_ps(zblended, f[i], shufdMask);
        sums      = _mm256_add_ps(sums, zblended);
        shufdMask = _mm256_castsi256_ps(_mm256_slli_epi32(_mm256_castps_si256(shufdMask), 1));
    }

    /* Final Summation */
    _mm256_store_ps(sumsf, sums);
    return sumsf[0] + sumsf[1] + sumsf[2] + sumsf[3] + sumsf[4] + sumsf[5] + sumsf[6] + sumsf[7];
}

/**
 * Reference C implementation
 */

float dotRefC(float f[128], unsigned char mask[16]){
    float sum = 0.0;
    int i;

    for(i=0;i<128;i++){
        sum += ((mask[i>>3]>>(i&7))&1) ? f[i] : 0.0;
    }

    return sum;
}

/**
 * Main
 */

int main(void){
    /* Variables */
        /* Loop Counter */
    int           i;

        /* Data to process */
    float         data[128] ALIGNTO(32);
    unsigned char mask[16]  ALIGNTO(32);
    float         refCSum, sseSum;

        /* Time tracking */
    clock_t       t1, t2, t3;
    double        refCTime, sseTime;



    /* Initialize mask and float arrays with some random data. */
    for(i=0;i<128;i++){
        if(i<16)
            mask[i]=rand();

        data[i] = random();
    }


    /* RUN TESTS */
    t1 = clock();
    for(i=0;i<NUM_ITERS;i++){
        refCSum = dotRefC(data, mask);
    }
    t2 = clock();
    for(i=0;i<NUM_ITERS;i++){
        sseSum  = dotAVX2((__m256*)data, mask);
    }
    t3 = clock();


    /* Compute time taken */
    refCTime = (double)(t2-t1)/CLOCKS_PER_SEC;
    sseTime  = (double)(t3-t2)/CLOCKS_PER_SEC;


    /* Print out results */
    printf("Results:\n"
           "RefC: Time: %f    Value: %f\n"
           "SSE:  Time: %f    Value: %f\n",
           refCTime, refCSum,
           sseTime,  sseSum);

    return 0;
}

Explanation

The same concept as for SSE4.1 is used. The difference is that now we're processing 8 floats at a time and making use of AVX2's 256-bit registers and PMOVMASKB from ymm registers (which gather 256/8 = 32 bits). Because of this, we now have a simpler bitmask shuffle, and a much simpler loop.

In the mask, the 256 bits

  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
 16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31
 32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47
 48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63
 64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79
 80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95
 96  97  98  99 100 101 102 103 104 105 106 107 108 109 110 111
112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159
160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175
176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191
192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207
208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223
224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239
240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255

are permuted, using 8 PMOVMASKB and 8 PSLLW instructions, to

  0   8  16  24  32  40  48  56  64  72  80  88  96 104 112 120
128 136 144 152 160 168 176 184 192 200 208 216 224 232 240 248
  1   9  17  25  33  41  49  57  65  73  81  89  97 105 113 121
129 137 145 153 161 169 177 185 193 201 209 217 225 233 241 249
  2  10  18  26  34  42  50  58  66  74  82  90  98 106 114 122
130 138 146 154 162 170 178 186 194 202 210 218 226 234 242 250
  3  11  19  27  35  43  51  59  67  75  83  91  99 107 115 123
131 139 147 155 163 171 179 187 195 203 211 219 227 235 243 251
  4  12  20  28  36  44  52  60  68  76  84  92 100 108 116 124
132 140 148 156 164 172 180 188 196 204 212 220 228 236 244 252
  5  13  21  29  37  45  53  61  69  77  85  93 101 109 117 125
133 141 149 157 165 173 181 189 197 205 213 221 229 237 245 253
  6  14  22  30  38  46  54  62  70  78  86  94 102 110 118 126
134 142 150 158 166 174 182 190 198 206 214 222 230 238 246 254
  7  15  23  31  39  47  55  63  71  79  87  95 103 111 119 127
135 143 151 159 167 175 183 191 199 207 215 223 231 239 247 255

. For 128-element bit-with-float dot-products, we then iterate in parallel on eight sets of 16 elements. This implementation can easily be extended for 256-element DPs by iterating on 32-element sets. Only one loop is required now.

Specifically, to change this to work for 256-elment dot products, you would

Double the size of the function arguments. __m256 f[32], unsigned char maskArg[32].
Swap the mask load (= _mm256_castsi128_si256(_mm_load_si128((const __m128i*)maskArg));) with = _mm256_load_si256((const __m256i*)maskArg);.
Delete the compensating shift left by 16 just below below the call to bitMaskShufflePINSRW.
Run the loop down from set 31 instead of 15: for(i=31;i>=0;i--)

I can neither test nor even run the code as my CPU is SSE4.1, but Clang with

clang -Ofast -ftree-vectorize -finline-functions -funroll-loops -mavx2 -msse4.1 -mssse3 dotavx2.c -o dottest

compiled cleanly (you may get faster code without unrolling), producing this:

(gdb) disas dotAVX2
Dump of assembler code for function dotAVX2:
0x0000000100001730 <+0>:    push   %rbp
0x0000000100001731 <+1>:    mov    %rsp,%rbp
0x0000000100001734 <+4>:    vmovdqa (%rsi),%xmm0
0x0000000100001738 <+8>:    vpslld $0x1,%ymm0,%ymm1
0x000000010000173d <+13>:   vpslld $0x1,%ymm1,%ymm2
0x0000000100001742 <+18>:   vpmovmskb %ymm2,%eax
0x0000000100001746 <+22>:   vpslld $0x1,%ymm2,%ymm2
0x000000010000174b <+27>:   vpmovmskb %ymm2,%ecx
0x000000010000174f <+31>:   vxorps %ymm3,%ymm3,%ymm3
0x0000000100001753 <+35>:   vmovd  %ecx,%xmm4
0x0000000100001757 <+39>:   vpinsrd $0x1,%eax,%xmm4,%xmm4
0x000000010000175d <+45>:   vpmovmskb %ymm1,%eax
0x0000000100001761 <+49>:   vpinsrd $0x2,%eax,%xmm4,%xmm1
0x0000000100001767 <+55>:   vpslld $0x1,%ymm2,%ymm2
0x000000010000176c <+60>:   vpslld $0x1,%ymm2,%ymm4
0x0000000100001771 <+65>:   vpslld $0x1,%ymm4,%ymm5
0x0000000100001776 <+70>:   vpmovmskb %ymm0,%eax
0x000000010000177a <+74>:   vpinsrd $0x3,%eax,%xmm1,%xmm0
0x0000000100001780 <+80>:   vpmovmskb %ymm5,%eax
0x0000000100001784 <+84>:   vpslld $0x1,%ymm5,%ymm1
0x0000000100001789 <+89>:   vpmovmskb %ymm1,%ecx
0x000000010000178d <+93>:   vmovd  %ecx,%xmm1
0x0000000100001791 <+97>:   vpinsrd $0x1,%eax,%xmm1,%xmm1
0x0000000100001797 <+103>:  vpmovmskb %ymm4,%eax
0x000000010000179b <+107>:  vpinsrd $0x2,%eax,%xmm1,%xmm1
0x00000001000017a1 <+113>:  vpmovmskb %ymm2,%eax
0x00000001000017a5 <+117>:  vpinsrd $0x3,%eax,%xmm1,%xmm1
0x00000001000017ab <+123>:  vinserti128 $0x1,%xmm0,%ymm1,%ymm0
0x00000001000017b1 <+129>:  vpslld $0x10,%ymm0,%ymm0
0x00000001000017b6 <+134>:  vblendvps %ymm0,0x1e0(%rdi),%ymm3,%ymm1
0x00000001000017c0 <+144>:  vpslld $0x1,%ymm0,%ymm0
0x00000001000017c5 <+149>:  vblendvps %ymm0,0x1c0(%rdi),%ymm3,%ymm2
0x00000001000017cf <+159>:  vaddps %ymm2,%ymm1,%ymm1
0x00000001000017d3 <+163>:  vpslld $0x1,%ymm0,%ymm0
0x00000001000017d8 <+168>:  vblendvps %ymm0,0x1a0(%rdi),%ymm3,%ymm2
0x00000001000017e2 <+178>:  vaddps %ymm2,%ymm1,%ymm1
0x00000001000017e6 <+182>:  vpslld $0x1,%ymm0,%ymm0
0x00000001000017eb <+187>:  vblendvps %ymm0,0x180(%rdi),%ymm3,%ymm2
0x00000001000017f5 <+197>:  vaddps %ymm2,%ymm1,%ymm1
0x00000001000017f9 <+201>:  vpslld $0x1,%ymm0,%ymm0
0x00000001000017fe <+206>:  vblendvps %ymm0,0x160(%rdi),%ymm3,%ymm2
0x0000000100001808 <+216>:  vaddps %ymm2,%ymm1,%ymm1
0x000000010000180c <+220>:  vpslld $0x1,%ymm0,%ymm0
0x0000000100001811 <+225>:  vblendvps %ymm0,0x140(%rdi),%ymm3,%ymm2
0x000000010000181b <+235>:  vaddps %ymm2,%ymm1,%ymm1
0x000000010000181f <+239>:  vpslld $0x1,%ymm0,%ymm0
0x0000000100001824 <+244>:  vblendvps %ymm0,0x120(%rdi),%ymm3,%ymm2
0x000000010000182e <+254>:  vaddps %ymm2,%ymm1,%ymm1
0x0000000100001832 <+258>:  vpslld $0x1,%ymm0,%ymm0
0x0000000100001837 <+263>:  vblendvps %ymm0,0x100(%rdi),%ymm3,%ymm2
0x0000000100001841 <+273>:  vaddps %ymm2,%ymm1,%ymm1
0x0000000100001845 <+277>:  vpslld $0x1,%ymm0,%ymm0
0x000000010000184a <+282>:  vblendvps %ymm0,0xe0(%rdi),%ymm3,%ymm2
0x0000000100001854 <+292>:  vaddps %ymm2,%ymm1,%ymm1
0x0000000100001858 <+296>:  vpslld $0x1,%ymm0,%ymm0
0x000000010000185d <+301>:  vblendvps %ymm0,0xc0(%rdi),%ymm3,%ymm2
0x0000000100001867 <+311>:  vaddps %ymm2,%ymm1,%ymm1
0x000000010000186b <+315>:  vpslld $0x1,%ymm0,%ymm0
0x0000000100001870 <+320>:  vblendvps %ymm0,0xa0(%rdi),%ymm3,%ymm2
0x000000010000187a <+330>:  vaddps %ymm2,%ymm1,%ymm1
0x000000010000187e <+334>:  vpslld $0x1,%ymm0,%ymm0
0x0000000100001883 <+339>:  vblendvps %ymm0,0x80(%rdi),%ymm3,%ymm2
0x000000010000188d <+349>:  vaddps %ymm2,%ymm1,%ymm1
0x0000000100001891 <+353>:  vpslld $0x1,%ymm0,%ymm0
0x0000000100001896 <+358>:  vblendvps %ymm0,0x60(%rdi),%ymm3,%ymm2
0x000000010000189d <+365>:  vaddps %ymm2,%ymm1,%ymm1
0x00000001000018a1 <+369>:  vpslld $0x1,%ymm0,%ymm0
0x00000001000018a6 <+374>:  vblendvps %ymm0,0x40(%rdi),%ymm3,%ymm2
0x00000001000018ad <+381>:  vaddps %ymm2,%ymm1,%ymm1
0x00000001000018b1 <+385>:  vpslld $0x1,%ymm0,%ymm0
0x00000001000018b6 <+390>:  vblendvps %ymm0,0x20(%rdi),%ymm3,%ymm2
0x00000001000018bd <+397>:  vaddps %ymm2,%ymm1,%ymm1
0x00000001000018c1 <+401>:  vpslld $0x1,%ymm0,%ymm0
0x00000001000018c6 <+406>:  vblendvps %ymm0,(%rdi),%ymm3,%ymm0
0x00000001000018cc <+412>:  vaddps %ymm0,%ymm1,%ymm0
0x00000001000018d0 <+416>:  vpshufd $0x1,%xmm0,%xmm1
0x00000001000018d5 <+421>:  vaddss %xmm1,%xmm0,%xmm1
0x00000001000018d9 <+425>:  vmovhlps %xmm0,%xmm0,%xmm2
0x00000001000018dd <+429>:  vaddss %xmm1,%xmm2,%xmm1
0x00000001000018e1 <+433>:  vpshufd $0x3,%xmm0,%xmm2
0x00000001000018e6 <+438>:  vaddss %xmm1,%xmm2,%xmm1
0x00000001000018ea <+442>:  vextracti128 $0x1,%ymm0,%xmm0
0x00000001000018f0 <+448>:  vaddss %xmm1,%xmm0,%xmm1
0x00000001000018f4 <+452>:  vpshufd $0x1,%xmm0,%xmm2
0x00000001000018f9 <+457>:  vaddss %xmm1,%xmm2,%xmm1
0x00000001000018fd <+461>:  vpshufd $0x3,%xmm0,%xmm2
0x0000000100001902 <+466>:  vmovhlps %xmm0,%xmm0,%xmm0
0x0000000100001906 <+470>:  vaddss %xmm1,%xmm0,%xmm0
0x000000010000190a <+474>:  vaddss %xmm0,%xmm2,%xmm0
0x000000010000190e <+478>:  pop    %rbp
0x000000010000190f <+479>:  vzeroupper 
0x0000000100001912 <+482>:  retq   
End of assembler dump.

As we can see, the kernel is 3 instructions (vblendvps, vaddps, vpslld) now.

You are in the right path: how the mask is formed (i.e. the least amount of instruction, parallelised, no unpredictable memory accesses), makes the most difference in the algorithm. I'd hope to find a 4-instruction kernel... — Aki Suihkonen, Oct 14 '13 at 20:25
Well, I think we may be already at the bare minimum number of instructions unfortunately. There's the conditional selection/blend of the source into a zero vector, there's the accumulation, there's the shift and there's the reformation of the zero vector. — Iwillnotexist Idonotexist, Oct 14 '13 at 20:44
That said MOVDQA can be handled at the register-rename stage, and the result of that XORPS is known to the processor to be independent of the previous contents of XMM4 and is thus handled specially (and more speedily). I would expect better performance with the non-destructive operations in AVX (wouldn't have to remake the 0-vector). Of course, AVX also has 256-bit vectors holding 8 floats, and VBLENDVPS is not forced to use XMM0 as the source of its mask. I only wonder why is the bit mask not in XMM0. I see no particular reason why XMM0 couldn't be shifted in-place. — Iwillnotexist Idonotexist, Oct 14 '13 at 21:00
This sounded too good to be true. So I tried adding volatile in refCTime, sseTime definition, and used addition (+=) in NUM_ITERS loops to prevent dead-code elimination. It's about 5% slower than Chris Dodd's SSE2 code. — John Smith, Nov 08 '13 at 14:57
By the way, your reference C code is slow since it's operating on char array. — John Smith, Nov 08 '13 at 15:04
gcc4.8 doesn't kill off the loop (using the same flags plus -std=gnu99) — John Smith, Nov 08 '13 at 15:06
With USE_PINSRW 1, it's 10% faster than the SSE2 code however. — John Smith, Nov 08 '13 at 15:10
@John Smith, I have edited my solution to include (untested, because I don't have Haswell hardware) AVX2 code. Please try it. — Iwillnotexist Idonotexist, Nov 11 '13 at 00:40
@Aki Suihkonen: It turns out in AVX2 it is possible to kill off _two_ more instructions in the kernel! My kernel is now 3 instructions only. — Iwillnotexist Idonotexist, Nov 11 '13 at 00:43
Wow! Thanks! Can you wait a bit? It may take some time until I can start using the Haswell CPU. I will do the benchmarks against SSE2 code. I'm also planning to extend the test BTW: I will prepare a random 128-bit bitvector array and take average by looping over it instead; the current benchmark is sensitive to the input bitvector. — John Smith, Nov 12 '13 at 06:47
Can I ask what's your usual number of elements in the dot-product? The AVX2 code would probably do best with 256 elements in a dot-product, but can do fewer if the `shufdMask` is shifted. That's why in this 128-element version I shift it left by 16. — Iwillnotexist Idonotexist, Nov 12 '13 at 12:48
Also I don't know about the wisdom of loop unrolling on Haswell; The loop kernel is so ridiculously tiny that it may be profitable to not unroll, or to limit the unroll factor, such that it fits entirely within the uop cache and loop cache. Cloning the loop body and shifting left by two in both copies could also help, but I'm unaware of a way to convince to compiler to do so. — Iwillnotexist Idonotexist, Nov 12 '13 at 12:52
I see, thanks! It's 128 due to memory-related restrictions. With memory upgraded to 64GB someday, I hope I can go 256. — John Smith, Nov 13 '13 at 04:53
I will try and measure the effect of GCC's -fno-tree-loop-optimize switch. — John Smith, Nov 13 '13 at 04:54
I'm wondering though, since full 256-bits are not required, wouldn't AVX code suffice and have similar performance? If that is the case, I can test it right away. — John Smith, Nov 13 '13 at 05:09
Mmmm... I think you meant to test -fno-unroll-loops? AVX2 is somewhat unavoidable because VPMOVMSKB and VPSLLD are 256-bit integer instructions, and Intel for some reason decided that only float ops would get 256-bit extensions in AVX. Integer ops only got that privilege with AVX2. — Iwillnotexist Idonotexist, Nov 13 '13 at 11:46
Chris Dodd's code, because it uses only the AND bitwise operation (available in a floating-point version), is actually implementable in AVX. However, you're faced then with the choice of either extending his LUT to 256 elements of 8 floats of 4 bytes each (8KB), or to load the mask from your LUT in two 4-element halves, shuffle the upper half into the top 128 bits of a YMM register, AND it with your data and accumulate. — Iwillnotexist Idonotexist, Nov 13 '13 at 11:56
Hi again! I'm sorry about the delay. I finally got the Haswell hardware. I'm trying to test it, but I keep getting segfaults in dotAVX2. I will try to find the reason.. — John Smith, Jan 20 '14 at 14:49
However, the improvement is definitely there. It appears to be working ~20x faster than the sse4.1 version on i7 4770K! (sans the random segmentation faults). Thanks so much! Now I will try to fix the segfault... — John Smith, Jan 20 '14 at 14:56
@JohnSmith Cool! The only place I could see a segfault coming out from in my benchmark is the fact that in `main` I declared the data as aligned to 16 bytes when it should be 32. I don't know where else the segfault would arise in my code, unless it be a misaligned access. If you can check that the first argument you pass in to `dotAVX2` is 32-bytes-aligned and the other is 16-bytes-aligned, it should work. Another thing you could try is Valgrinding my code; Optimized, it ought to make precisely one memory access to each of the __m256 array elements and to the mask. — Iwillnotexist Idonotexist, Jan 20 '14 at 19:05
Whoa! I changed the alignment of mask to 32 and it worked like a charm! Huge thanks! I will test it on a real-world data set and see how it performs and post results here! — John Smith, Jan 22 '14 at 08:36
I just realized that the main program cannot make any alignment guarantees. I'm wondering whether if there is a way of making an unaligned version?... — John Smith, Jan 22 '14 at 08:40
@JohnSmith Luckily that's not fundamentally a problem. First change the load of the mask from `_mm_load_si128` to `_mm_loadu_si128`, if the mask can't be aligned. That might be enough but I doubt it. If not, replace the `f[i]` argument of `_mm256_blendv_ps` with `_mm256_loadu_ps((float*)&f[i])`. Hopefully this will not inordinately slow down the code. — Iwillnotexist Idonotexist, Jan 22 '14 at 14:25
I integrated the AVX2 code into the main code and performed benchmarks. I'm very confused, the SSE2 code takes 26.5ns whereas AVX2 code takes 29.8ns on average... On a i7 4770K Haswell, with aligned instructions. I'll investigate more. — John Smith, Jan 24 '14 at 05:45
@JohnSmith Excellent! What now if you try `-fno-unroll-loops`? — Iwillnotexist Idonotexist, Jan 24 '14 at 07:23
Oh, sorry. I made a mistake! I compiled with gcc -Ofast -mtune=native -march=native last time. I just tried with gcc and used the options you wrote, and it is reduced to 25.9ns! (clang gives 26.2) — John Smith, Jan 24 '14 at 07:50
With -fno-unroll-loops, it's 26.4ns. Maybe the reason for SSE2 and AVX2 performing similar is because the total data I feed into the function during the loop does not fit into the cache? (I have a big data array and each time, I'm feeding a consecutive piece of it to the function, both for f and mask) — John Smith, Jan 24 '14 at 07:55
By the way, I found that this is a more realistic test: http://pastebin.com/X6vVbE8Y Results: RefC: Time: 1.309480 Value: 155216157834280960.000000 SSE41: Time: 0.065533 Value: 155216155722465280.000000 AVX2: Time: 0.045633 Value: 155216155786321920.000000 SSE2: Time: 0.055111 Value: 155216155875123200.000000 — John Smith, Jan 24 '14 at 08:48
@JohnSmith This is all quite bizarre. If you could also pastebin the disassemblies of this code I'll have a look. — Iwillnotexist Idonotexist, Jan 24 '14 at 12:12
Sure, here is the clang assembly output: http://pastebin.com/ZS28SFGW I added srand(time(0)) at the beginning to randomize the data, and made a fresh run: RefC: Time: 1.317266 Value: 156455311676047360.000000 SSE41: Time: 0.092267 Value: 156455308830023680.000000 AVX2: Time: 0.054270 Value: 156455308236431360.000000 SSE2: Time: 0.055330 Value: 156455308361850880.000000 It appears that the performance slightly depends on the data — John Smith, Jan 26 '14 at 06:32
I just found a bug: warning: variable 'imask' is uninitialized when used here [-Wuninitialized] imask = _mm256_insert_epi32(imask, _mm256_movemask_epi8(mask), 7); — John Smith, Feb 24 '14 at 14:09
@JohnSmith Mmmm, that may be, but this is merely the first of 8 PINSRD instructions which, between themselves, cover the entire 8x32=256-bit mask. But the compiler may have been too smart for its own good and deleted the UB. While I don't have access to Haswell hardware, in less than a week I may; I will then try to micro-optimize some more. The assembler dump you gave doesn't have the best instruction scheduling, and Haswell's throughput on _addition_ can be increased if using `a = fma3(a, b, 1.0)`. Lastly I was thinking of doing two >> by 2 so as to keep the EUs busy and not bottleneck. — Iwillnotexist Idonotexist, Feb 24 '14 at 18:17
As always, thanks so much! Can't wait for it :) BTW, I just updated the C code: http://pastebin.com/nS1mVjTE (added srand and a simple unit-test) — John Smith, Feb 25 '14 at 01:07
I've updated the code to normalize the data points http://pastebin.com/H8npgSBi and make the loads unaligned. I also initialized imask to 0. Strangely, the results of the AVX2 code are usually off by 5-10% in comparison to others. Here's a sample run: Done init dotRefC: 33.503750 dotSSE41: 33.503754 AVX2: 35.673569 SSE2: 33.503754 refC done SSE41 done AVX2 done SSE2 done Results: RefC: Time: 1.300441 Value: 72763287.925549 SSE41: Time: 0.064334 Value: 72763287.874126 AVX2: Time: 0.042647 Value: 72757084.042282 SSE2: Time: 0.054811 Value: 72763287.952557 Is AVX2 less sensitive? — John Smith, Feb 25 '14 at 10:19
@JohnSmith I do notice one oddity. In my post above, the AVX2 loop _descends_ from 15 downto 0, as it should since the high-order bits represent elements towards the end of the array. In the code you just posted up you _ascend_ 0 to 15. That would be wrong, but I don't know where that comes from. In any case by 15 d.t. 0 behaviour was fully intentional, to match with the shift lefts and VBLENDVPS instructions. — Iwillnotexist Idonotexist, Feb 25 '14 at 12:31
Oh, indeed! My bad, I'm sorry, it is indeed giving the same result as SSE4.1 code after correction. Thanks so much! — John Smith, Feb 25 '14 at 14:11
@JohnSmith I haven't yet gotten my Haswell, but in the meanwhile I duplicated the inner loop's body and am now shifting by two in each: http://pastebin.com/qWv3A9D0 Compile that with `clang -Ofast -mavx2 dotavx2.c -o dotavx2`, and try both with `-fno-unroll-loops` and `-funroll-loops`. I also added in comments the additions taken as FMA's with 1; Allegedly they may have higher throughput but also higher latency. That needs the extra flag `-mfma`. EDIT: I forgot to add a macro, #define _mm256_slli_ps(x, a) \ _mm256_castsi256_ps(_mm256_slli_epi32(_mm256_castps_si256((x)), (a))); — Iwillnotexist Idonotexist, Mar 01 '14 at 21:32
Thanks again! The best version turned out to be with loops unrolled, without fma; 11% reduction in run-time in comparison with the previous AVX2 version (and runs 1.35x faster than SSE2 version): RefC: Time: 1.307929 Value: 72278155.868073 SSE41: Time: 0.068159 Value: 72278154.884682 AVX2: Time: 0.039961 Value: 72278155.037003 SSE2: Time: 0.053807 Value: 72278154.955769 http://pastebin.com/cN4xNHsz — John Smith, Mar 11 '14 at 05:25
@JohnSmith Great news – I just got my Haswell i7-4700 hardware and once I can finally dual-boot with Linux it'll be much easier for me to micro-optimize for you. — Iwillnotexist Idonotexist, Mar 11 '14 at 06:48
@JohnSmith System up and running; And I'm capable of reproducing your Clang results, modulo the fact that a 4770K is much faster than a 4700MQ. GCC 4.8 did a much poorer job. Let's go for the kill! — Iwillnotexist Idonotexist, Mar 12 '14 at 03:03
Congrats! This is great news! Can't wait for further news, fingers crossed :) — John Smith, Mar 12 '14 at 06:10
@JohnSmith: I suspect it would be better to interleave the mask-handling with the mask usage. e.g. start at the *end* of the float array, and use the top 8 bits of the mask. Start by broadcasting the top 32 bits of the mask to all elements of a YMM. Then `vpsllvd` to left shift by `[0,1,2,3,4,5,6,7]`, leaving the top bit of the mask in at the top of the high element, and the 8th-highest bit of the mask in the low element. Then do a masked load from the high end of the vector with `vblendvps`, or better `vmaskmovps` (fewer uops on Skylake, same as blendv on Haswell). — Peter Cordes, Apr 03 '19 at 11:19
... Then `vpslld` by 8 to bring the next 8 mask bits into place, and decrement your pointer by 32 bytes. You have to stop and broadcast + variable-shift new mask bits in every 4 vectors, but interleaving that work with the FP adds is probably very good for hiding latency of everything involved. (Speaking of which, you'd want at least 2 accumulator vectors to hide (V)ADDPS latency). And BTW, see also [is there an inverse instruction to the movemask instruction in intel avx2?](//stackoverflow.com/a/36491672) for more mask -> vector, especially for 32-bit element width. — Peter Cordes, Apr 03 '19 at 11:24

score 4 · Answer 3 · answered Apr 19 '13 at 13:13

If one allows a slightly different permutation in either the data float data[128], or does the corresponding permutation in the bitmask for the __m128 mask;, one can slightly improve the algorithm suggested by Chris Dodd above. (Not counting the time required to permute the mask, this implementation (+ overhead) is about 25% faster). This of course is just a quick draft of my idea provided in the comments.

union {
    unsigned int i[4];
    float f[4];
    __m128   xmm;
} mask = { 0xFF00FF00, 0xF0F0F0F0, 0xCCCCCCCC, 0xAAAAAAAA };

float dot2(__m128 *a, __m128 mask);
// 20M times 1.161s

float dotref(__m128 *a, unsigned int *mask) // 20M times 8.174s
{
    float z=0.0f;
    int i;
    for (i=0;i<32;i++) {
       if (mask[0] & (0x80000000U >> i)) z+= a[i][0];
       if (mask[1] & (0x80000000U >> i)) z+= a[i][1];
       if (mask[2] & (0x80000000U >> i)) z+= a[i][2];
       if (mask[3] & (0x80000000U >> i)) z+= a[i][3];       
    }
   return z;
}

The corresponding assembler implementation would be:

dot2:
    // warm up stage: fill in initial data and
    // set up registers
    pxor %xmm1, %xmm1      ;; // clear partial sum1
    pxor %xmm2, %xmm2      ;; // clear partial sum2
    movaps (%rdi), %xmm3   ;; // register warm up stage1
    movaps 16(%rdi), %xmm4 ;; // next 4 values
    pxor %xmm5, %xmm5
    pxor %xmm6, %xmm6
    lea 32(%rdi), %rdi
    movl $16, %ecx            ;; // process 2x4 items per iteration (total=128)
a:  ;; // inner loop -- 2 independent data paths
    blendvps %xmm3, %xmm5
    pslld $1, %xmm0
    movaps (%rdi), %xmm3   
    blendvps %xmm4, %xmm6
    pslld $1, %xmm0
    movaps 16(%rdi), %xmm4
    addps %xmm5, %xmm1
    pxor  %xmm5, %xmm5
    addps %xmm6, %xmm2
    pxor  %xmm6, %xmm6
    lea 32(%rdi), %rdi
    loop a
 ;; // cool down stage: gather results (xmm0 = xmm1+xmm2)
 ;; // in beautiful world this stage is interleaved
 ;; // with the warm up stage of the next block
    addps %xmm2, %xmm1
    movaps  %xmm1, %xmm2
    movaps  %xmm1, %xmm0
    shufps  $85, %xmm1, %xmm2
    addss   %xmm2, %xmm0
    movaps  %xmm1, %xmm2
    unpckhps %xmm1, %xmm2
    shufps  $255, %xmm1, %xmm1
    addss   %xmm2, %xmm0
    addss   %xmm1, %xmm0
    ret

One should also duplicate mask `xmm1 = xmm0 << 1`; and shift both by 2. This would remove the last dependency in the inner loop. One can also try to find a better interleaving offset. This uses offset of 16, but perhaps an offset of 64 or 256 is better. Also either option may lead to easier calculation for the bitmask permutation (in xmm0 and xmm1). — Aki Suihkonen, Apr 19 '13 at 15:20

score 3 · Answer 4 · edited May 23 '17 at 12:08

3

Here are some things to try.

Try to get the compiler to use CMOV instead of a branch. (Note that using a union this way is well-defined in C11 but undefined in C++11.)

union {
    int i;
    float f;
} u;
u.i = 0;
if (b & 1) {
    u.f = a[i];
}
sum += u.f;

Use a multiply instead of a branch.

sum += (b & 1) * a[i];

Keep several sums and add them at the end to reduce data flow dependencies. (You could combine either of the above suggestions with this one.)

float sum0 = 0, sum1 = 0, sum2 = 0, sum3 = 0;
for (int i = 0; i < 64; i += 4; b >>= 4) {
    if (b & 1) sum0 += a[i];
    if (b & 2) sum1 += a[i+1];
    if (b & 4) sum2 += a[i+2];
    if (b & 8) sum3 += a[i+3];
}
return sum0 + sum1 + sum2 + sum3;

Reduce the number of branches by processing several bits at a time:

for (int i = 0; i < 64; i += 4, b >>= 4) {
    switch (b & 0xf) {
        case 0:
            break;
        case 1:
            sum += a[i];
            break;
        case 2:
            sum += a[i + 1];
            break;
        case 3:
            sum += a[i] + a[i+1];
            break;
        case 4:
            sum += a[i+2];
            break;
        // etc. for cases up to and including 15
    }
}

You could keep several sums and for each sum process several bits at a time. In that case you'd probably want to use a macro or an inlined function and invoke it four times.

edited May 23 '17 at 12:08

Community

1
1

answered Apr 17 '13 at 04:39

rob mayoff

375,296
67
796
848

1

`!!(b & 1)` is the same as just `(b & 1)`, isn't it? – Carl Norum Apr 17 '13 at 04:45
@CarlNorum Ha ha, yes it is. Good call. I have edited my answer. – rob mayoff Apr 17 '13 at 04:47
No problem, just less to type. :-) Great suggestions. – Carl Norum Apr 17 '13 at 04:47
1

You would still need `!!(b & 2)`, or `(b & 2) >> 1`, if you want to keep multiple running sums. – rob mayoff Apr 17 '13 at 04:48
1

Thanks for the input. The do look nice, but they perform worse in reality (using the same data set as before). With multiplication: cost 201% time. I actually benchmarked with different sums before I wrote this post, around 158% slower. Using switch: shifting two bits at a time gives %154 time of the original, shifting four bits at a time gives 130%. – John Smith Apr 17 '13 at 05:25
With the change "i < 64" to "v && i < 64", 4-bit switch-case runs at 105%. – John Smith Apr 17 '13 at 05:34
I tried processing 8 bits at at time (256 cases total), on my machine it's about twice as fast as the naive approach. I only have done a limited testing with a constant bit-vector though. – n. m. could be an AI Apr 17 '13 at 07:37
I have a dataset with about a million entries, and the result I report is the average BTW. – John Smith Oct 11 '13 at 05:01

score 3 · Answer 5 · answered Apr 18 '13 at 10:32

I've found that the generated assembly for Chris Dodd's code is very strongly compiler-dependent; clang turns it into a loop while gcc (4.6 and 4.7) and Intel icc (12.x and 13.x) unroll the loop instead. Still, one can reduce dependencies (the need to wait for the previous sum +=) by turning it into a map-reduce,

float dot(__m128 *a, uint64_t b) {
    __m128 sum[8];
    int i;

    for (i = 0; i < 8; i++) {
            sum[i] = _mm_add_ps(
                    _mm_and_ps(a[2*i], mask[b & 0xf].xmm),
                    _mm_and_ps(a[2*i+1], mask[(b & 0xf0) >> 4].xmm));
            b >>= 8;
    }
    for (i = 0; i < 4; i++) {
            sum[i] = _mm_add_ps(sum[2*i], sum[2*i+1]);
    }
    sum[0] = _mm_add_ps(sum[0], sum[1]);
    sum[2] = _mm_add_ps(sum[2], sum[3]);

    sum[0] = _mm_add_ps(sum[0], sum[2]);

    sum[0] = _mm_hadd_ps(sum[0], sum[0]);
    sum[0] = _mm_hadd_ps(sum[0], sum[0]);

    i = _mm_extract_ps(sum[0], 0);
    return *((float*)(&i));
}

This creates distinctly inferior assembly with clang (which stores the sum[] on the stack) but better code (no dependencies on subsequent addps) with gcc and icc. Interestingly enough, only gcc gets the idea at the end that the lower float in sum[0] can be returned in-place...

Nice exercise on how to tweak for specific compilers ...

accumulators is probably overkill for such a short vector, but yes at least 2 accumulators would be a much better idea. Especially if handling the mask with scalar integer stuff and a LUT, unrolling that to deal with whole bytes is good. If compilers are smart, they'd even notice that `mask[b&0xf0)>>4]` can optimize away the right shift, because each element of `mask` is a 16-byte vector so just `b&0xf0` is already a byte index. — Peter Cordes, Apr 03 '19 at 11:01
BTW, your scalar extract is evil. Never use `_mm_extract_ps` for that, use `_mm_cvtss_f32(sum[0]);`. And see [Fastest way to do horizontal float vector sum on x86](//stackoverflow.com/q/6996764) for a better hsum than 2x `haddps`. — Peter Cordes, Apr 03 '19 at 11:02

chill · Answer 6 · 2013-10-11T05:59:00.620

1

If you have i7, then you have SSE4.1 and you can use the _mm_dp_ps intrinsic, which does a dot product. With your example code it would look like

#include <stdint.h>
#include <immintrin.h>

const float fltmask[][4] =
  {{0, 0, 0, 0}, {0, 0, 0, 1}, {0, 0, 1, 0}, {0, 0, 1, 1},
   {0, 1, 0, 0}, {0, 1, 0, 1}, {0, 1, 1, 0}, {0, 1, 1, 1},
   {1, 0, 0, 0}, {1, 0, 0, 1}, {1, 0, 1, 0}, {1, 0, 1, 1},
   {1, 1, 0, 0}, {1, 1, 0, 1}, {1, 1, 1, 0}, {1, 1, 1, 1}};

// a has 64 elements. b is a bitvector of 64 dimensions.
float dot(float * restrict a, uint64_t b) {
     int i;
    float sum = 0;
    for(i=0; b && i<64; i+=4,b>>=4) {
        __m128 t0 = _mm_load_ps (a);
        a += 4;
        __m128 t1 = _mm_load_ps (fltmask[b & 15]);
        sum += _mm_cvtss_f32 (_mm_dp_ps (t0, t1, 15));
    }
    return sum;
}

PS. The arrays a and fltmask better be 16-bytes aligned!

PPS. When compiled with gcc -std=c99 -msse4 -O2 the loop looks like:

.L3:
    movq    %rdx, %rax
    movaps  (%rcx), %xmm1
    shrq    $4, %rdx
    andl    $15, %eax
    addq    $16, %rcx
    addl    $4, %r8d
    salq    $4, %rax
    testq   %rdx, %rdx
    dpps    $15, (%r9,%rax), %xmm1
    addss   %xmm1, %xmm0
    jne .L13

and with -O3 its unrolled, of course.

edited Oct 11 '13 at 05:59

answered Oct 11 '13 at 05:24

chill

16,470
2
40
44

Thanks for the input! Sorry, I have only i3 here (SSE3) so I can't do benchmarks right-now. I can access an i7 this weekend though. I'm wondering what are the benchmark numbers (for ~1M uint64_t numbers)? – John Smith Oct 11 '13 at 05:37
i5 also has SSE4.1. I think all desktop Intel CPUs since Core 2 (Penryn) have SSE4.1. No idea for benchmark numbers, it depends also on a lot of other factors (memory, caches). – chill Oct 11 '13 at 05:39
Sorry, I check /proc/cpu and it turned out to be and i3 + SSE3. In any case, I'll try it tomorrow. I have a wide data set, so such factors will be smoothed out. – John Smith Oct 11 '13 at 05:40
And just a minor question, what is the best way to generalize this to 128-bits? Just doubling the loop? Or replacing 64 with 128? – John Smith Oct 11 '13 at 05:40
Yeah, a second loop over a second bitmask. Or two calls to the function :) – chill Oct 11 '13 at 05:44
Overall, the program runs slightly slower than the SSE2 version that uses _mm_and_ps on i5. I'm wondering, however, is the mask you're passing to _mm_dp_ps correct? Because results also look strange (and wrong, in fact) with this function. – John Smith Oct 12 '13 at 03:42
I just checked the documentation for _mm_and_ps, and it appears that the upper four-bit and at least the lowest bit should be set, if I'm not making any mistake here. With the mask 15<<4 | 1, results now make sense. It runs ~5% slower than the SSE2 version, however. – John Smith Oct 12 '13 at 03:51
@JohnSmith, ah yes, the mask also controls which elements of the vectors take part in the dot product, so your `(15 << 4) | 1` is correct - the upper part selects the sources, the lower parts select the destinations. If you mean the version by Chriss Dodd above, yes, it uses only `andps` and `addps` in the loop and they both have latency of one CPU cycle, whereas `dpps` is at 11 cycles. – chill Oct 12 '13 at 06:36
3

The dot product and other horizontal instructions are known to be inferior (speedwise) to longer and more complex sequences. – Aki Suihkonen Oct 15 '13 at 09:37
1

Indeed, this turned out to be ~2x slower than the SSE2 code on my i5 machine with SSE4.2 :( – John Smith Nov 08 '13 at 15:17

score 1 · Answer 7 · answered Apr 03 '19 at 11:58

Another way to describe this is a masked reduction (sum).

AVX512 has masking built-in, e.g. with _mm256_mask_add_ps. (For such short arrays, you may want to only use 256-bit vectors on Skylake-AVX512, to avoid limiting max turbo. Unless your code spends a lot of time doing this or other loops that can benefit from 512-bit vectors.

It's literally one asm instruction to vaddps from a memory source, leaving elements unchanged where the bit in the mask register was unset. e.g. your loop body can literally be as simple as

vaddps    zmm0{k1}, zmm0, [rdi]    ; merge-masking into ZMM0
kshiftrq  k1, k1, 16               ; bring the next 16 mask bits down to the bottom

(And starting with a zero-masked load to initialize zmm0. Actually you'll want multiple accumulators to hide FP-add latency, so multiple zero-masked loads using the low couple vectors of bits from your mask is ideal.)

So really you'd want a compiler to spit out asm something like this. (Should be straightforward to write with intrinsics, but more verbose so I just wrote asm instead of taking the time to look up the intrinsic names.)

   ;float dot(float *restrict a, uint64_t b)
dot:
    kmovq     k1, rsi                  ; __mmask64 m = b
    vmovups   zmm0{k1}{z}, [rdi]       ; zero-masking
    kshiftrq  k2, k1, 16
    vmovups   zmm1{k2}{z}, [rdi+ 16*4]   ; next vector of 16x 4-byte floats
    kshiftrq  k3, k1, 32
    kshiftrq  k4, k1, 48

    vaddps    zmm0{k3}, zmm0, [rdi + 16*4 * 2]   ; merge-masking into ZMM0
    vaddps    zmm1{k4}, zmm1, [rdi + 16*4 * 3]

    ;; if you have more mask data, use it and do some more adds, or maybe run 2 more dep chains of FP add before combining.

    vaddps    zmm0, zmm0, zmm1               ; reduce down to 1 accumulator

    ;then horizontal sum down to scalar
    VEXTRACTF64x4   ymm1, zmm0, 1
    vaddps          ymm0, ymm0, ymm1    ; narrow to 256
    vextractf128    xmm1, ymm0, 1
    vaddps          xmm0, xmm0, xmm1    ; narrow to 128
    vmovshdup / vaddps / vmovhlps / vaddss

    ret

See Fastest way to do horizontal float vector sum on x86 for how to write horizontal sums efficiently in C with intrinsics, so they compile to asm that doesn't suck. (Not haddps, and not a naive store to array and loop over it.)

If mask data is coming from memory, you could load it directly into mask registers. Unfortunately on current Skylake-AVX512, that costs 3 uops, so you want to do a 64-bit load and use mask-shifts. Or let the compiler do its thing.

With AVX2, @IWill's answer looks like it's on the right track, but I suspect it would be better to interleave the mask-handling with the mask usage.

e.g. start at the end of the float array, and use the top 8 bits of the mask. Start by broadcasting the top 32 bits of the mask to all elements of a YMM. Then vpsllvd to left shift by [0,1,2,3,4,5,6,7], leaving the top bit of the mask in at the top of the high element, and the 8th-highest bit of the mask in the low element.

Then do a masked load from the high end of the vector with vblendvps, or better vmaskmovps (fewer uops on Skylake, same as blendv on Haswell). vblendvps is always at least 2 uops. Fun fact: the non-VEX version with implicit XMM0 source is single-uop on Skylake.

Then vpslld by 8 to bring the next 8 mask bits into place, and decrement your pointer by 32 bytes.

You have to stop and broadcast + variable-shift new mask bits in every 4 vectors, but interleaving that work with the FP adds is probably very good for hiding latency of everything involved.

See is there an inverse instruction to the movemask instruction in intel avx2? for more details about turning bitmasks into vector masks, including the broadcast + variable-shift strategy I described above. (Chris Dodd's LUT is good for 4 bits -> 4 dwords, because the table is small. Beyond that you want ALU.)

score 0 · Answer 8 · edited Apr 03 '19 at 10:07

You can eliminate the branch like this:

for(int i=0; b && i<64; i++, b>>=1)
    sum += a[i] * (b & 1);

Although this will add an extra mult, at least it won't destroy your pipeline.

Another way to control the branch is using it the way you are, but using also an compiler macro. I guess in gcc it is the likely(if ...) macro. You will use the branch, but that way you are telling the compiler that the branch will be executed more than often, and gcc will optimize more.

Another optimization that can be done is "caching" the dot product. So instead of having a function to calculate the dot product, you would have a variable holding the product initialized to 0 in the beginning. And every time you inserted/removed/updated an element of the vector, then you would also update the variable holding the result.

If you want to encourage branchless, write `sum += b&1 ? a[i] : 0;`. An actual FP multiply by 0.0 or 1.0 is normally a bad idea, and with strict FP that's what you'll get. Although ideally a compiler would contract it into an FMA, if available, but still not good. — Peter Cordes, Apr 03 '19 at 10:55

Fast dot product of a bit vector and a floating point vector

8 Answers8

SSE4.1 (old, for AVX2 see below)

Code

Explanation

AVX2 (new, but untested)

Code

Explanation