4

How to implement without loop an operation on bitmasks, which for two bitmasks a and b of width n gives bitmask c of width 2 * n with following properties:

  • i-th bit in c is set only if there is j-th bit in a and k-th bit in b and j + k == i

C++ implementation:

#include <bitset>
#include <algorithm>
#include <iostream>

#include <cstdint>
#include <cassert>

#include <x86intrin.h>

std::uint64_t multishift(std::uint32_t a, std::uint32_t b)
{
    std::uint64_t c = 0;
    if (_popcnt32(b) < _popcnt32(a)) {
        std::swap(a, b);
    }
    assert(a != 0);
    do {
        c |= std::uint64_t{b} << (_bit_scan_forward(a) + 1);
    } while ((a &= (a - 1)) != 0); // clear least set bit
    return c;
}

int main()
{
    std::cout << std::bitset< 64 >(multishift(0b1001, 0b0101)) << std::endl; // ...0001011010
}

Can it be reimplemented without loop using some bit tricks or some x86 instructions?

Tomilov Anatoliy
  • 15,657
  • 10
  • 64
  • 169
  • 1
    Have you checked what assembly a compiler *actually* generates when building with optimizations enabled? Are you *sure* you actually *need* to hand optimize this? Is it time critical? Is it called often? Why is this important? If it's not critically important and a bottleneck, then just write readable code instead of obfuscating it with micro optimizations. – Jesper Juhl Mar 01 '18 at 19:22
  • 1
    @JesperJuhl It is time critical. It used to calculate sumset for subset sum problem. Background is [here](https://stackoverflow.com/questions/49051732/boolean-convolution-using-integer-multiplication/49053489#49053489). – Tomilov Anatoliy Mar 01 '18 at 19:23
  • The `std::cout` is probably taking orders of magnitude more time in that code than all the other lines combined. – Jesper Juhl Mar 01 '18 at 19:27
  • 1
    @JesperJuhl seems you can't understand abstract question. – Tomilov Anatoliy Mar 01 '18 at 19:32
  • 1
    The abstract question would be better written as a function that takes two `uint32_t` args and returns a `uint64_t`, not a program that prints a compile-time constant result. Then you could look at the compiler-generated asm with optimization enabled (at least for the case where it doesn't inline into a caller with one of the operands being a constant). – Peter Cordes Mar 04 '18 at 21:18

2 Answers2

6

This is like a multiplication that uses OR instead of ADD. As far as I know there is no truly amazing trick. But here's a trick that actually avoids intrinsics rather than using them:

while (a) {
    c |= b * (a & -a);
    a &= a - 1;
}

This is very similar to your algorithm but uses multiplication to shift b left by trailing zero count of a, a & -a being a trick to select just the lowest set bit as a mask. As a bonus, that expression is safe to execute when a == 0, so you can unroll (and/or turn the while into do/while without precondition) without nasty edge cases showing up (which is not the case with TZCNT and shift).


pshufb could be used in parallel table-lookup mode, using a nibble of a to select a sub-table and then using that to multiply all nibbles of b by that nibble of a in one instruction. For the multiplication proper, that's 8 pshufbs max (or always 8 since there's less point in trying to early-exit with this). It does require some weird setup at the start and some unfortunate horizontal stuff to finish it off though, so it may not be so great.

harold
  • 61,398
  • 6
  • 86
  • 164
  • @Orient: on Skylake, `bsf` has 3 cycle latency and only runs on port 1, i.e. the same performance as `imul`. So BMI1 `blsi` to feed `imul` should have about the same performance as `bsf` to feed `shlx`. And better performance than `bsf` to feed `shl`, because variable-count shifts without BMI2 are 3 uops on Intel CPUs. – Peter Cordes Mar 01 '18 at 19:57
  • @Orient: Your problem is very close to a carryless multiply ([`pclmuludq`](https://github.com/HJLebbink/asm-dude/wiki/PCLMULQDQ), but you want `or` instead of `xor`. (A carryless multiply is one where `xor` replaces `add` in the standard shift-and-add; `xor` is add-without-carry.) IDK if it's possible to use `pclmuludq` as part of this; probably not :( – Peter Cordes Mar 01 '18 at 20:02
  • I was thinking about the `a + b = (a ^ b) + 2(a & b) = (a | b) + (a & b)` identities, trying to figure out if I could combine `pclmuludq` and normal multiplication in some way to get to OR-ed partial products, but I got nowhere with that.. – harold Mar 01 '18 at 20:05
3

Since this question is tagged with BMI (Intel Haswell and newer, AMD Excavator and newer), I assume that AVX2 is also of interest here. Note that all processors with BMI2 support also have AVX2 support. On modern hardware, such as Intel Skylake, a brute force AVX2 approach may perform quite well in comparison with a scalar (do/while) loop, unless a or b only have very few bits set. The code below computes all the 32 multiplications (see Harold's multiplication idea) without searching for the non-zero bits in the input. After the multiplying everything is OR-ed together.

With random input values of a, the scalar codes (see above: Orient's question or Harold's answer) may suffer from branch misprediction: The number of instructions per cycle (IPC) is less than the IPC of the branchless AVX2 solution. I did a few tests with the different codes. It turns out that Harold's code may benefit significantly from loop unrolling. In these tests throughput is measured, not latency. Two cases are considered: 1. A random a with on average 3.5 bits set. 2. A random a with on average 16.0 bits set (50%). The cpu is Intel Skylake i5-6500.

Time in sec.
                             3.5 nonz      16 nonz  
mult_shft_Orient()             0.81          1.51                
mult_shft_Harold()             0.84          1.51                
mult_shft_Harold_unroll2()     0.64          1.58                
mult_shft_Harold_unroll4()     0.48          1.34                
mult_shft_AVX2()               0.44          0.40               

Note that these timing include the generation of the random input numbers a and b. The AVX2 code benefits from the fast vpmuludq instruction: it has a throughput of 2 per cycle on Skylake.


The code:

/*      gcc -Wall -m64 -O3 -march=broadwell mult_shft.c    */
#include <stdint.h>
#include <stdio.h>
#include <x86intrin.h>

uint64_t mult_shft_AVX2(uint32_t a_32, uint32_t b_32) {

    __m256i  a          = _mm256_broadcastq_epi64(_mm_cvtsi32_si128(a_32));
    __m256i  b          = _mm256_broadcastq_epi64(_mm_cvtsi32_si128(b_32));
                                                           //  0xFEDCBA9876543210  0xFEDCBA9876543210  0xFEDCBA9876543210  0xFEDCBA9876543210     
    __m256i  b_0        = _mm256_and_si256(b,_mm256_set_epi64x(0x0000000000000008, 0x0000000000000004, 0x0000000000000002, 0x0000000000000001));
    __m256i  b_1        = _mm256_and_si256(b,_mm256_set_epi64x(0x0000000000000080, 0x0000000000000040, 0x0000000000000020, 0x0000000000000010));
    __m256i  b_2        = _mm256_and_si256(b,_mm256_set_epi64x(0x0000000000000800, 0x0000000000000400, 0x0000000000000200, 0x0000000000000100));
    __m256i  b_3        = _mm256_and_si256(b,_mm256_set_epi64x(0x0000000000008000, 0x0000000000004000, 0x0000000000002000, 0x0000000000001000));
    __m256i  b_4        = _mm256_and_si256(b,_mm256_set_epi64x(0x0000000000080000, 0x0000000000040000, 0x0000000000020000, 0x0000000000010000));
    __m256i  b_5        = _mm256_and_si256(b,_mm256_set_epi64x(0x0000000000800000, 0x0000000000400000, 0x0000000000200000, 0x0000000000100000));
    __m256i  b_6        = _mm256_and_si256(b,_mm256_set_epi64x(0x0000000008000000, 0x0000000004000000, 0x0000000002000000, 0x0000000001000000));
    __m256i  b_7        = _mm256_and_si256(b,_mm256_set_epi64x(0x0000000080000000, 0x0000000040000000, 0x0000000020000000, 0x0000000010000000));

    __m256i  m_0        = _mm256_mul_epu32(a, b_0);
    __m256i  m_1        = _mm256_mul_epu32(a, b_1);
    __m256i  m_2        = _mm256_mul_epu32(a, b_2);
    __m256i  m_3        = _mm256_mul_epu32(a, b_3);
    __m256i  m_4        = _mm256_mul_epu32(a, b_4);
    __m256i  m_5        = _mm256_mul_epu32(a, b_5);
    __m256i  m_6        = _mm256_mul_epu32(a, b_6);
    __m256i  m_7        = _mm256_mul_epu32(a, b_7);

             m_0        = _mm256_or_si256(m_0, m_1);     
             m_2        = _mm256_or_si256(m_2, m_3);     
             m_4        = _mm256_or_si256(m_4, m_5);     
             m_6        = _mm256_or_si256(m_6, m_7);
             m_0        = _mm256_or_si256(m_0, m_2);     
             m_4        = _mm256_or_si256(m_4, m_6);     
             m_0        = _mm256_or_si256(m_0, m_4);     

    __m128i  m_0_lo     = _mm256_castsi256_si128(m_0);
    __m128i  m_0_hi     = _mm256_extracti128_si256(m_0, 1);
    __m128i  e          = _mm_or_si128(m_0_lo, m_0_hi);
    __m128i  e_hi       = _mm_unpackhi_epi64(e, e);
             e          = _mm_or_si128(e, e_hi);
    uint64_t c          = _mm_cvtsi128_si64x(e);
                          return c; 
}


uint64_t mult_shft_Orient(uint32_t a, uint32_t b) {
    uint64_t c = 0;
    do {
        c |= ((uint64_t)b) << (_bit_scan_forward(a) );
    } while ((a = a & (a - 1)) != 0);
    return c; 
}


uint64_t mult_shft_Harold(uint32_t a_32, uint32_t b_32) {
    uint64_t c = 0;
    uint64_t a = a_32;
    uint64_t b = b_32;
    while (a) {
       c |= b * (a & -a);
       a &= a - 1;
    }
    return c; 
}


uint64_t mult_shft_Harold_unroll2(uint32_t a_32, uint32_t b_32) {
    uint64_t c = 0;
    uint64_t a = a_32;
    uint64_t b = b_32;
    while (a) {
       c |= b * (a & -a);
       a &= a - 1;
       c |= b * (a & -a);
       a &= a - 1;
    }
    return c; 
}


uint64_t mult_shft_Harold_unroll4(uint32_t a_32, uint32_t b_32) {
    uint64_t c = 0;
    uint64_t a = a_32;
    uint64_t b = b_32;
    while (a) {
       c |= b * (a & -a);
       a &= a - 1;
       c |= b * (a & -a);
       a &= a - 1;
       c |= b * (a & -a);
       a &= a - 1;
       c |= b * (a & -a);
       a &= a - 1;
    }
    return c; 
}



int main(){

uint32_t a,b;

/*
uint64_t c0, c1, c2, c3, c4;
a = 0x10036011;
b = 0x31000107;

//a = 0x80000001;
//b = 0x80000001;

//a = 0xFFFFFFFF;
//b = 0xFFFFFFFF;

//a = 0x00000001;
//b = 0x00000001;

//a = 0b1001;
//b = 0b0101;

c0 = mult_shft_Orient(a, b);        
c1 = mult_shft_Harold(a, b);        
c2 = mult_shft_Harold_unroll2(a, b);
c3 = mult_shft_Harold_unroll4(a, b);
c4 = mult_shft_AVX2(a, b);          
printf("%016lX \n%016lX     \n%016lX     \n%016lX     \n%016lX \n\n", c0, c1, c2, c3, c4);
*/

uint32_t rnd = 0xA0036011;
uint32_t rnd_old;
uint64_t c;
uint64_t sum = 0;
double popcntsum =0.0;
int i;

for (i=0;i<100000000;i++){
   rnd_old = rnd;
   rnd = _mm_crc32_u32(rnd, i);      /* simple random generator                                   */
   b = rnd;                          /* the actual value of b has no influence on the performance */
   a = rnd;                          /* `a` has about 50% nonzero bits                            */

#if 1 == 1                           /* reduce number of set bits from about 16 to 3.5                 */
   a = rnd & rnd_old;                                   /* now `a` has about 25 % nonzero bits    */
          /*0bFEDCBA9876543210FEDCBA9876543210 */     
   a = (a & 0b00110000101001000011100010010000) | 1;    /* about 3.5 nonzero bits on average      */                  
#endif   
/*   printf("a = %08X \n", a);                */

//   popcntsum = popcntsum + _mm_popcnt_u32(a); 
                                               /*   3.5 nonz       50%   (time in sec.)  */
//   c = mult_shft_Orient(a, b );              /*      0.81          1.51                  */
//   c = mult_shft_Harold(a, b );              /*      0.84          1.51                */
//   c = mult_shft_Harold_unroll2(a, b );      /*      0.64          1.58                */
//   c = mult_shft_Harold_unroll4(a, b );      /*      0.48          1.34                */
   c = mult_shft_AVX2(a, b );                /*      0.44          0.40                */
   sum = sum + c;
}
printf("sum = %016lX \n\n", sum);

printf("average density = %f bits per uint32_t\n\n", popcntsum/100000000);

return 0;
}

Function mult_shft_AVX2() uses zero based bit indices! Just like Harold's answer. It looks like that in your question you start counting bits at 1 instead of 0. You may want to multiply the answer by 2 to get the same results.

wim
  • 3,702
  • 19
  • 23
  • Why not just `_mm256_set1_epi32`? Does explicitly writing `_mm256_broadcastq_epi64(_mm_cvtsi32_si128(a_32))` compile better on something? Although it should be fine, and with `-march=skylake-avx512`, the compiler should probably still be able to optimize it down to `vpbroadcastd ymm0, eax` instead of separate movd/broadcast. Or if the value started in memory (after inlining), hopefully the compiler could optimize away the `mov` and broadcast directly from memory (as cheap as a `movd` load). – Peter Cordes Mar 08 '18 at 07:32
  • BTW, `vpunpckhqdq` is the best choice for that step of a horizontal OR. Saves 1 byte of code size to leave out the `imm8` shift count of `vpsrldq`. Alternative: `movq` / `pextrq` and use integer `or`, but that's more total uops (`pextrq` is 2 uops). – Peter Cordes Mar 08 '18 at 07:42
  • @harold Indeed I didn't try the unrolled loop. This is fixed now. Unrolling helps a lot! – wim Mar 08 '18 at 14:31
  • 1
    @PeterCordes What I actually wanted is not `_mm256_set1_epi32` but `_mm256_set_epi32(0, a_32, 0, a_32, 0, a_32, 0, a_32)`. Another possibility would be `_mm256_set1_epi64x((uint64_t)a_32)`, but the `_mm256_broadcastq_epi64(_mm_cvtsi32_si128(a_32))` leads to the best results, [see godbolt link](https://godbolt.org/g/krCK55). Although there might be no advantage at all after inlining, I admit. – wim Mar 08 '18 at 14:33
  • @PeterCordes Fixed the `vpunpckhqdq` in the code, thanks. – wim Mar 08 '18 at 14:51
  • 1
    @wim: Oh, that makes sense. I missed the 64-bit vs. 32-bit issue, which should have been obvious with the widening multiply. And wow, all 3 compilers make really braindead code (integer load / `vmovq` / `vbroadcastq`) with a memory operand https://godbolt.org/g/Tx34Ue for the "more natural" `_mm256_set1_epi64x((uint64_t)a_32)` way of writing it. I also tried adding `a_32++;` to simulate already-zero-extended case, and yes unfortunately `_mm256_broadcastq_epi64(_mm_cvtsi32_si128(a_32));` does defeat gcc/clang `-march=skylake-avx512`. They fail to use `vpbroadcastq ymm0, rdi`, unlike ICC. – Peter Cordes Mar 08 '18 at 15:05