2

I am new to writing some avx intrinsics based code so need some help in understanding if my observations are expected. I have 2 methods implementing distance computations, both methods take 2 float arrays and its dimension and returns a float distance. The first method computes a euclidean distance

   static float
    compute_l2Square(const void *pVect1v, const void *pVect2v, const void *qty_ptr) {
        float *pVect1 = (float *) pVect1v;
        float *pVect2 = (float *) pVect2v;
        size_t qty = *((size_t *) qty_ptr);
        float __attribute__((aligned(32))) TmpRes[8];
        size_t qty16 = qty >> 4;

        const float *pEnd1 = pVect1 + (qty16 << 4);

        __m256 diff, v1, v2;
        __m256 sum = _mm256_set1_ps(0);

        while (pVect1 < pEnd1) {
            v1 = _mm256_loadu_ps(pVect1);
            pVect1 += 8;
            v2 = _mm256_loadu_ps(pVect2);
            pVect2 += 8;
            diff = _mm256_sub_ps(v1, v2);
            sum = _mm256_add_ps(sum, _mm256_mul_ps(diff, diff));

            v1 = _mm256_loadu_ps(pVect1);
            pVect1 += 8;
            v2 = _mm256_loadu_ps(pVect2);
            pVect2 += 8;
            diff = _mm256_sub_ps(v1, v2);
            sum = _mm256_add_ps(sum, _mm256_mul_ps(diff, diff));
        }

        _mm256_store_ps(TmpRes, sum);
        return TmpRes[0] + TmpRes[1] + TmpRes[2] + TmpRes[3] + TmpRes[4] + TmpRes[5] + TmpRes[6] + TmpRes[7];
    }

The second method computes a bitwise xor and then counts number of 1 i.e hamming distance

static float compute_hamming(const void* __restrict pVect1v,
                     const void* __restrict pVect2v,
                     const void* __restrict qty_ptr) {
   float *pVect1 = (float *) pVect1v;
   float *pVect2 = (float *) pVect2v;
   size_t qty = *((size_t *)qty_ptr);
   uint64_t __attribute__((aligned(32))) TmpRes[4];
   size_t qty16 = qty >> 4;

   const float *pEnd1 = pVect1 + (qty16 << 4);
   int res = 0;
   __m256 diff, v1, v2;
    while (pVect1 < pEnd1) {
              v1 = _mm256_loadu_ps(pVect1);
              pVect1 += 8;
              v2 = _mm256_loadu_ps(pVect2);
              pVect2 += 8;
              diff = _mm256_xor_ps(v1, v2);
              _mm256_store_si256( (__m256i*)TmpRes,  _mm256_castps_si256(diff));
              res += __builtin_popcountll(TmpRes[0]) + __builtin_popcountll(TmpRes[1])
              + __builtin_popcountll(TmpRes[2]) + __builtin_popcountll(TmpRes[3]);

              v1 = _mm256_loadu_ps(pVect1);
              pVect1 += 8;
              v2 = _mm256_loadu_ps(pVect2);
              pVect2 += 8;
              diff = _mm256_xor_ps(v1, v2);
              _mm256_store_si256( (__m256i*)TmpRes,  _mm256_castps_si256(diff));
              res += __builtin_popcountll(TmpRes[0]) + __builtin_popcountll(TmpRes[1])
                            + __builtin_popcountll(TmpRes[2]) + __builtin_popcountll(TmpRes[3]);
          }
  return res;
    }

For the same number of bits, l2 square distance computation is much faster than hamming i.e almost 2x-4x 9 ( i.e computing l2 distance for 512 bits which 16 floats is faster than computing hamming on the 16 floats) . I am not really sure if this is expected . To me it seems that popcount and storing the results to temp is causing some slowness , because when i modify the l2 distance computation to do xor operation instead of sub i.e replace _mm256_sub_ps with _mm256_xor_ps the l2 computation becomes more fast.

I am benchmarking on a mac os, which has avx instruction support. Also another observation is a non avx implementation of hamming distance using just loop : sum += popcount(vec_a[i] ^ vec_b[i]) is also giving similar numbers as avx implementation . I also checked that avx instructions and methods are invoked just for sanity checks.

The non vectorized implementation :

static float compute_hamming(const void* __restrict pVect1,
                     const void* __restrict pVect2,
                     const void* __restrict qty_ptr) {
  size_t qty = *((size_t *)qty_ptr);
  int res = 0;


  const float *pVect1LL = (const float *)pVect1;
  const float *pVect2LL = (const float *)pVect2;
  for (unsigned i = 0; i < qty; i = i + 2) {
    if (i + 1 == qty) {
      unsigned int v1;
      unsigned int v2;
      memcpy(&v1, &pVect1LL[i], sizeof(float));
      memcpy(&v2, &pVect2LL[i], sizeof(float));
      res += __builtin_popcount(v1 ^ v2);
      break;
    }
    uint64_t v1;
    uint64_t v2;
    memcpy(&v1, &pVect1LL[i], sizeof(float) * 2);
    memcpy(&v2, &pVect2LL[i], sizeof(float) * 2);

    res += __builtin_popcountll(v1 ^ v2);
  }

  return res;
}

Need some help and recommendations on improving the performance since the bottleneck is distance computation method.

user179156
  • 841
  • 9
  • 31
  • IIUC the running time for 1024 bits or less cpu seems to be performing the best ? Am i misinterpreting something ? – user179156 Feb 22 '22 at 07:46

2 Answers2

4

You could speed up your l2Square version more by using _mm256_fmadd_ps, if you're targeting Haswell and newer. (And Piledriver, except you're on a Mac and you probably don't care about AMD Hackintosh machines.)

Equally or more importantly, by using two separate __m256 sum0, sum1 accumulators to hide FP latency, adding them together at the end before reducing. (With an efficient hsum, not just store and then scalar add of each element in turn.)


Without hardware SIMD popcount (AVX512 VPOPCOUNTDQ), yes of course it's going to be slower, especially if the compiler doesn't vectorize those per-element __builtin_popcountll(vec[0]) + ... into SIMD popcount using a nibble LUT or something (vpshufb).

The way you're doing it is actually making things worse for clang, by getting it to do SIMD XOR but then actually extract to scalar instead of just using scalar XOR and popcnt in the first place; note the vpextrq instructions in the asm. Clang can auto-vectorize __builtin_popcountll in a loop (in a not-terrible but not great way), but not like this. (Actually, SIMD XOR and then scalar extract for popcnt is not nearly as bad as I thought, but only if you use 128-bit vectors; see the "sse-cpu" results from Wojciech Mula's git repo linked below where even SSE for pure loads doesn't slow it down much.)

For example, clang auto-vectorizes this with YMM vectors inside the loop. (Godbolt showing this and your code) Unfortunately it does a bad job with char* arrays, and with unsigned instead of unsigned long it only uses XMM vectors.

float compute_hamming_autovec(const unsigned long* __restrict a, 
                              const unsigned long* __restrict b,
                              size_t qty)    // by value to keep it simpler, IDK why you'd want to pass this by reference with a void*
{
    //const unsigned char *__restrict a = pVect1v, *__restrict b = pVect2v;
    unsigned long sum = 0;
    for (size_t i=0 ; i<qty*4 ; i++){
        unsigned long tmp1=a[i], tmp2=b[i];
        //memcpy(&tmp1, a+i, 4);
        //memcpy(&tmp2, b+i, 4);
        sum += __builtin_popcountll(tmp1 ^ tmp2);
    }
    return sum;   // why are we returning this integer count as a float?  IDK
}

Using memcpy for unaligned aliasing-safe loads from char* also seemed to defeat vectorization, or some variation on this used scalar load and xor; you may need typdef uint64_t aliasing_unaligned_u64 __attribute__((aligned(4), may_alias)). (I used aligned(4) on the assumption you're pointing it at aligned floats.)

However, your best bet is to manually vectorize the SIMD popcount. See https://github.com/WojciechMula/sse-popcount/. That also avoids any futzing with types to make strict-aliasing-safe code that will auto-vectorize nicely over arrays of float data.

For large counts, it's possible to go even faster than a good implementation of using just vpshufb ymm / vertical sum inner loop / vpsadbw to hsum to qwords before it can overflow. For example, the Harley Seal SIMD popcount code in that repo is about 20% faster on Skylake than the best "avx-lookup" implementation from the same repo, for arrays of size 4096 bytes. (And twice as fast as "avx2-lookup-original"; I forget what the difference was.) See results for clang on Skylake

Changing popcnt_AVX2_lookup to take two pointers and _mm256_xor_si256 is trivial, just replace the __m256i vec = _mm256_loadu with those couple statements. Or do the same with Harley-Seal if your arrays are large enough to warrant it; it shouldn't cause any extra register pressure since it can compile to a load / memory-source-vpxor.

Also tweak its unroll factor to be good with your typical problem sizes.


Since small size is apparently common for your use-case (which I didn't realize originally):

Another thing to consider with your real use case is how frequently you'll have odd sizes. If AVX2-lookup is only good with a multiple of the unroll factor, and needs unrolling to keep up, you might end up with a lot of your inputs spending a lot of time in its fallback path. So it would either be important to make that efficient, or be a good reason to drop it and just use SSE2 XOR + scalar popcnt which can easily do 16-byte granularity without a problem.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • IIUC the running time for 1024 bits or less cpu seems to be performing the best ? Am i misinterpreting something ? Most of my inputs will be less than 2048 bits so does it still make sense to try simd implementations ? – user179156 Feb 22 '22 at 08:04
  • few noob questions : could you please elaborate on why simd xor and then extract to scalar is worse than using scalar xor and pocnt ? Is this only for clang or gcc too ? Why vpextrq is bad ? The autovec implementation you added is very similar to what i had previously and i tried adding avx implementation hoping to squeeze out some perf gains for my input which are usually < 2048 bits. I added the original implementation as well – user179156 Feb 22 '22 at 08:05
  • @user179156: Yes, in those results "cpu" (scalar `popcntq`) is faster than "avx2-lookup" for array sizes up to 128 bytes (1024 bits). If your array sizes are normally small, you could tweak the unroll factor in that implementation. With the extra benefit of doing SIMD XOR for two input arrays, there's more benefit to SIMD, although the "builtin-popcnt-movdq" and "sse-cpu" versions show that loading (or load+xor) with SIMD and then scalar extract is less of a disaster than I expected, as long as it's extracting in 64-bit chunks. – Peter Cordes Feb 22 '22 at 08:07
  • The "builtin-popcnt" numbers mirror the "avx2-lookup" numbers for clang because it auto-vectorizes in a very similar way for larger counts. – Peter Cordes Feb 22 '22 at 08:08
  • @user179156: The same repo has GCC5.3 results: https://github.com/WojciechMula/sse-popcount/blob/master/results/skylake/skylake-i7-6700-gcc5.3.0-avx2.rst for Skylake, quite similar in a lot of ways, except not auto-vectorizing "builtin-popcnt". And not unrolling for you, so "builtin-popcnt-unrolled " wins from 64 to 256B. – Peter Cordes Feb 22 '22 at 08:14
  • @user179156: Your loop unrolls by 2 vectors, 64 bytes, without cleanup for odd sizes, so I assumed large sizes were normal, hence my answer talking about Harley-Seal. Anyway, for small sizes especially, not having to check for scalar cleanup helps SIMD relatively more, less code outside the loop. **You can definitely tune the unroll factor in "avx2-lookup" to make sure smaller inputs can still use the main loop**, at the cost of more work for larger inputs. – Peter Cordes Feb 22 '22 at 08:15
  • @user179156: Anyway, if you want to try out SIMD + scalar unpack, it looks like SSE2 is much better than AVX2: it takes extra shuffles to get data out of the high half of a 256-bit YMM register. That's what the "avx2-cpu" version is doing, and you can see it's never optimal, losing quite badly at some sizes. The presence of `vpextrq` + scalar `popcntq` indicates that, highly non-optimal for large array sizes (which is what I was assuming you had). But also pretty crap for small sizes vs. 128-bit vectors where `vmovq` + `vpextrq` will get the job done. – Peter Cordes Feb 22 '22 at 08:21
  • `vpextrq` costs 2 uops, one of them a shuffle (port 5 only on Intel CPUs). https://uops.info/ and https://agner.org/optimize/ – Peter Cordes Feb 22 '22 at 08:21
  • Thanks a lot for prompt response. for my avx implementation , yes i didnt add the odd size handling yet because i was trying to gauge how much perf gain i can get with just favourable inputs of size 256/512/1024 . Will be adding that later . – user179156 Feb 22 '22 at 08:26
  • to clarify one of the points in comment above : "builtin-popcnt-movdq" and "sse-cpu" versions : the former just uses scalar xor and popcnt on int32/64 , the latter uses simd xor and scalar popcnt ? – user179156 Feb 22 '22 at 08:31
  • @user179156: None of those linked benchmarks include any xor; they're loading from one array. I think the description (scroll up on the results page) is saying that both of those use 128-bit SSE loads, and extract to 64-bit scalar for either `_popcnt64` or `__builtin_popcntll`. The point of testing that is to see how compilers do, not just CPUs, because compilers are part of the picture. Of course the repo has source code, so just check that if you want to be sure, or even compile it with your compiler to see what happens in the case that actually matters for your current usage. – Peter Cordes Feb 22 '22 at 08:35
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/242258/discussion-between-user179156-and-peter-cordes). – user179156 Feb 22 '22 at 08:41
  • @user179156: Another thing to consider with your real use case is how frequently you'll have odd sizes. If AVX2-lookup is only good with a multiple of the unroll factor, and needs unrolling to keep up, you might end up with a lot of your inputs spending a lot of time in its fallback path. So it would either be important to make that efficient, or be a good reason to drop it and just use SSE2 XOR + scalar popcnt which can easily do 16-byte granularity without a problem. – Peter Cordes Feb 22 '22 at 08:42
  • i tried 2 accumulator optimization (and efficient hsum from the links you shared) for l2square , didnt really see any gains at all . i was computing on 256 dimensional float on a cascade lake which as avx512 as well but i was calling the avx2 implementation. I also looked at godbolt output for current implementation vs the one you suggested : https://godbolt.org/z/qT5GzGW59 . I dont see any difference in the assembly , also i wonder why does compiler automatically do this multiple accumulator for us ? By the same logic should i also have used different variables for "diff" ? – user179156 Feb 24 '22 at 07:58
  • 1
    @user179156: If your arrays are small enough for OoO exec to overlap execution across separate calls, that can give the CPU enough independent work. Also, since you used clang without `-ffp-contract=fast`, it didn't combine mul/add intrinsics into FMA for you, so there's more work for each pair of vectors, less of a gap between throughput vs. latency of the critical path. Or if you're getting cache misses, that would slow things down enough to let the CPU catch up on the dep chain. – Peter Cordes Feb 24 '22 at 08:05
  • 1
    @user179156: An optimizing compiler *can* optimize with multiple accumulators if you tell it that to pretend that FP math is associative, with `-ffast-math`. Clang does when auto-vectorizing scalar code; I forget if it will with intrinsics. Otherwise it's not allowed to group the FP additions in a different order. If this was integer math, clang could use multiple accumulators even without `-ffast-math` because integer addition *is* associative. – Peter Cordes Feb 24 '22 at 08:08
  • @user179156: Why not different variables for diff? Read the top of my answer on [Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators)](https://stackoverflow.com/q/45113527) again, where I explain that write-only access to a register isn't a true dependency, and register renaming lets the hardware treat it as independent of previous usage. In your code, `vmovups ymm2, mem` / `vsubps ymm2, ymm2, mem` starts a new dep chain for `diff` every time you do `diff = _mm256_sub_ps(v1, v2);` - no use of the old value. – Peter Cordes Feb 24 '22 at 08:11
  • thanks a ton for the valuable information and clear explanations , really appreciate all the detailed and in depth answers. What about gcc , do i need to set --ffp-contract and ffast-math ? – user179156 Feb 24 '22 at 08:47
  • https://godbolt.org/z/3bYTMjb3s : it seems like with gcc when not using -ffp-contract=fast , it is still able to generate FMA instruction . Actually it fuses both fma and is able to do with just 1 FMA instruction. If i use 2 accumulators it needs 2 FMA and a final add of the 2 accumulators , whereas non multiple accumulator just uses 1 mul and 1 fma instead , does that mean non multiple accumulator gen code seems more optimized and faster ? – user179156 Feb 24 '22 at 10:16
  • 1
    @user179156: Right, GCC defaults to -ffp-contract=fast (violating the ISO C standard by contracting across expressions), clang doesn't. You used `-ffast-math` which already implies `-ffp-contract=fast`, though! Anyway, yeah multiple accumulators has slightly more code size, with an extra add or three after the loop. But static code-size isn't performance! Out-of-order exec is limited by dependency chain latency, so for non-tiny arrays having 2 accumulators lets it run twice as fast. (If it's hot in L1d or L2 cache). – Peter Cordes Feb 24 '22 at 10:35
  • 1
    @user179156: Also, it seems GCC actually does a *worse* job of optimizing the version you linked: there are five FMA-port instructions in the loop: 2x `vsubps`, 1x `vmulps`, 1x `faddps`, and 1x `vfmadd...ps`. GCC failed to contract one of the mul/add pairs. But with separate accumulators, https://godbolt.org/z/jT47doa89 (I used `#define` macros to hack up a version that could compile either way), we get 2x `vsubps` and 2x `vfmadd...ps`, only 4 total uops for ports 0 and 1 on Skylake. (GCC still failed to fold one of the loads into a memory source operand for `vsubps` though, wasting a uop) – Peter Cordes Feb 24 '22 at 10:37
  • 1
    @user179156: an extra `vxorps` zeroing instruction ahead of the loop, and an extra `vaddps` after the loop, is a small price to pay for a smaller loop with twice the instruction-level parallelism for the FP work. – Peter Cordes Feb 24 '22 at 10:38
  • Current GCC11.2 has the same missed optimizations. But clang (https://godbolt.org/z/sY479WTPT) doesn't miss any FP contractions, and makes smaller code but uses indexed addressing modes, costing more front-end uops on Haswell and Skylake (and probably also Ice Lake). ([Micro fusion and addressing modes](https://stackoverflow.com/q/26046634)) – Peter Cordes Feb 24 '22 at 10:42
  • regarding `GCC failed to contract one of the mul/add pairs` , which one ? Also what is this `vaddps ymm2, ymm2, ymm0` for ? i dont see why this instruction is needed in the non multiple accumulator version , why the instruction before `vfmadd132ps ymm0, ymm1, ymm0` directly use ymm2 as destination ? – user179156 Feb 24 '22 at 10:59
  • @user179156: Oh, I see what GCC is doing. With `-ffast-math`, it's re-arranging the single-accumulator version into `sum += diff1*diff1 + diff2*diff2` so the critical path latency actually *is* still the same, one add or FMA (which have the same latency on SKL and Cascade Lake). That's why it can't do 2 contractions, because one of the two multiplies has to be done separately, before the first FMA. So its `sum += fma(diff2, diff2, diff1*diff1)`. The `vaddps ymm2, ymm2, ymm0` is obviously adding into `sum`; you can see that's the same register it zeroed ahead of the loop, and hsums after. – Peter Cordes Feb 24 '22 at 11:09
  • It's still a missed optimization for GCC to do `vmovups ymm3, YMMWORD PTR [rsi+32]` separately from `vsubps`, though. Anyway, for larger arrays with more unrolling, multiple accumulators saves instructions since it allows more contraction; more FP work done by the same number of uops. If you're not unrolling by 8 (latency of 4 / recip-throughput of 0.5 = latency*bandwidth produce of 8 FMAs in flight in the pipeline), then GCC spending more uops to shorten the critical path latency is a good trick. – Peter Cordes Feb 24 '22 at 11:12
  • Without `-ffast-math`, GCC can't do that, so it does just contract both into FMAs, both on the critical path for YMM2. But still the missed optimization of not folding one of the loads into a memory operand for `vsubps` – Peter Cordes Feb 24 '22 at 11:14
  • https://godbolt.org/z/6Yzaqzsh7 : this is the implementation of hamming using avx512 with the popcount (AVX512 VPOPCOUNTDQ) . performance is still not close to l2square when operated on similar number of bits . gcc9.3 . Wondering why l2square is faster now that we have simd pocnt as well – user179156 Feb 25 '22 at 22:36
  • @user179156: You're now using 512-bit vectors for that; is the max turbo frequency penalty significant? Or a warm-up effect where it lowers throughput without switching frequency? [SIMD instructions lowering CPU frequency](https://stackoverflow.com/q/56852812) Does using a less terrible horizontal sum help for small input array sizes? https://godbolt.org/z/q6Mbzs49v is a lot fewer shuffles and adds than your asm compiled to, see [Fastest method to calculate sum of all packed 32-bit integers using AVX512 or AVX2](https://stackoverflow.com/q/60108658) – Peter Cordes Feb 25 '22 at 22:54
  • @user179156 Why are you returning the integer count as a float, anyway? That conversion seems like useless work. Anyway, https://uops.info/ says `VPOPCNTQ (ZMM, ZMM)` is 1 uop with 3-cycle latency for port 5 on ICL/ICX, so it should be about as efficient as your FP l2square. If it's not, I'd suspect problems with your benchmarking methodology, especially warm-up effects for 512-bit vectors, or possibly anything else in [Idiomatic way of performance evaluation?](https://stackoverflow.com/q/60291987) – Peter Cordes Feb 25 '22 at 22:54
  • the conversion int to float is just now an easy hack to conform to the distance function return type the library implements . In future we can modify that if that is a major perf issue – user179156 Feb 25 '22 at 22:58
  • is there an equivalent version of `_mm512_reduce_add_epi64` for 256 bits ? – user179156 Feb 25 '22 at 23:26
  • 1
    Apparently no, but I already linked [Fastest method to calculate sum of all packed 32-bit integers using AVX512 or AVX2](https://stackoverflow.com/q/60108658) in my previous comment. It's not like these functions are hard to write, just shuffle high -> low and add until you're down to 1 element. – Peter Cordes Feb 25 '22 at 23:36
0

Yeah, your observations are expected. Your code for Euclidean is more or less OK, but your code for Hamming is very inefficient.

Since you mentioned AVX1 but not AVX2, I assume you don’t have AVX2. In that case, I would do it like that, untested.

// Count set bits in every byte,
// add slices of 8 bytes together into a vector of two int64 lanes.
inline __m128i popcntSse( __m128i vec )
{
    // Isolate low and high 4 bit pieces from each byte
    const __m128i lowMask = _mm_set1_epi8( 0xF );
    __m128i a = _mm_and_si128( lowMask, vec );
    __m128i b = _mm_andnot_si128( lowMask, vec );
    b = _mm_srli_epi32( b, 4 );

    // Apply the lookup table
    const __m128i lookup = _mm_setr_epi8(
    //  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
        0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4
    );
    a = _mm_shuffle_epi8( lookup, a );
    b = _mm_shuffle_epi8( lookup, b );

    // Add two pieces together; adding 2 numbers [ 0 .. 4 ] each, not gonna overflow
    __m128i res = _mm_add_epi8( a, b );

    // Return horizontal sum of bytes
    return _mm_sad_epu8( res, _mm_setzero_si128() );
}

static float computeHammingDistance( const float* p1, const float* p2, size_t count )
{
    const float* const p1EndAligned = p1 + ( count / 4 ) * 4;
    const size_t remainder = ( count % 4 );

    __m128i acc = _mm_setzero_si128();
    // Process most of these values doing 4 of them per iteration
    while( p1 < p1EndAligned )
    {
        __m128i v1 = _mm_loadu_si128( ( const __m128i* )p1 );
        __m128i v2 = _mm_loadu_si128( ( const __m128i* )p2 );
        p1 += 4;
        p2 += 4;
        __m128i xx = _mm_xor_si128( v1, v2 );
        acc = _mm_add_epi64( acc, popcntSse( xx ) );
    }

    if( remainder != 0 )
    {
        __m128i v1, v2;
        // Load 1 .. 3 32-bit values into vectors
        switch( remainder )
        {
        case 1:
            v1 = _mm_cvtsi32_si128( *(const int*)( p1 ) );
            v2 = _mm_cvtsi32_si128( *(const int*)( p2 ) );
            break;
        case 2:
            v1 = _mm_cvtsi64_si128( *(const int64_t*)( p1 ) );
            v2 = _mm_cvtsi64_si128( *(const int64_t*)( p2 ) );
            break;
        case 3:
            v1 = _mm_cvtsi64_si128( *(const int64_t*)( p1 ) );
            v2 = _mm_cvtsi64_si128( *(const int64_t*)( p2 ) );
            v1 = _mm_insert_epi32( v1, *(const int*)( p1 + 2 ), 2 );
            v2 = _mm_insert_epi32( v2, *(const int*)( p2 + 2 ), 2 );
            break;
        }

        __m128i xx = _mm_xor_si128( v1, v2 );
        acc = _mm_add_epi64( acc, popcntSse( xx ) );
    }

    // Horizontally add both lanes in the accumulator
    uint64_t result = (uint64_t)_mm_cvtsi128_si64( acc ) + (uint64_t)_mm_extract_epi64( acc, 1 );

    // Convert to float;
    // note you will lose precision after about 16.7 millions of set bits due to FP32 on output.
    return (float)result;
}
Soonts
  • 20,079
  • 9
  • 57
  • 130
  • thanks for the suggestion. I checked on my local mac and i didnt see using sysctl -a | grep machdep.cpu.features machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C I am assuming no avx2 . – user179156 Feb 22 '22 at 18:59
  • Could you please elaborate a bit on hamming being inefficient , trying to learn more on this. Also I would be running actual code on some cloud latest hardware linux machines , so if oyu do have other suggestions that may be much more optimized if we have avx2 and avx512 available , happy to try them out as well – user179156 Feb 22 '22 at 19:01
  • In my original code , i do use the __m256i , so why does it work if my machine doesnt support these . I thought avx2 added the integer support and avx1 just supported floating , but my original inefficient code does compute correct results – user179156 Feb 22 '22 at 19:04
  • @user179156 “elaborate a bit on hamming being inefficient” I think the main issue is too many loads and stores. You’re using 2x the bandwidth: you load from source vectors, store the XOR, then load back with scalar instructions. What’s worse, the specific pattern for the last two is load-after-store. Note that XOR and other bitwise ops are borderline free, even AVX2 versions of them which handle 32 bytes at once: all CPUs have a single cycle of latency, and most CPUs can run several of them every clock. – Soonts Feb 22 '22 at 19:30
  • @user179156 “if you do have other suggestions that may be much more optimized if we have avx2” in that case you can do twice as much work with every instruction: use `_mm256_loadu_si256`, `_mm256_and[not]_si256`, `_mm256_shuffle_epi8`, `_mm256_sad_epu8`, etc. The remainder and final reduction become slightly more complicated, though: the remainder has [ 1 .. 7 ] scalars, and the accumulator has 4 lanes instead of 2. – Soonts Feb 22 '22 at 19:31
  • “I thought avx2 added the integer support and avx1 just supported floating” that’s generally true but some integer instructions, like `_mm256_store_si256` or `_mm256_testz_si256` are in AVX1 subset. No idea why, only Intel knows. – Soonts Feb 22 '22 at 19:31
  • regarding loads and stores , should not the compiler optimize them away by just using registers only ? does it really loads and stores to memory. Also regarding loads from original vectors to avx : those are pretty fast , because the euclidean does the same except the store to temp int array and popcnt which i thought will not actually result in load and store and will use registers only . – user179156 Feb 22 '22 at 20:27
  • @user179156 “should not the compiler optimize them away by just using registers only” maybe they should, but I don’t think I ever saw them doing that. “does it really loads and stores to memory” yes but that’s not too important, the CPU doesn’t normally wait for RAM transaction to complete. Still, it really loads and stores to L1D cache, and waits for both latencies before being able to proceed with `popcnt` instructions. – Soonts Feb 22 '22 at 20:45
  • 1
    i tried the implementation above but didn't observe any gains – user179156 Feb 22 '22 at 21:10
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/242284/discussion-between-user179156-and-soonts). – user179156 Feb 22 '22 at 21:22
  • @Soonts: VPSHUFB LUTs are barely break-even for XMM registers, vs. 1/clock scalar 6-bit `popcnt`. Good scalar code using mov / xor reg,mem / popcnt / add can xor+count 64 bits per clock cycle on Sandybridge (which the OP apparently has), although that leaves no room for loop overhead (4 uops / clock for the front-end), and it requires a non-indexed addressing mode for the XOR to keep the memory-source XOR micro-fused through issue/rename (since this is pre-Haswell). So 2 pointer increments of loop overhead + cmp/jcc; compilers don't know the trick of indexing one array relative to the other. – Peter Cordes Feb 22 '22 at 22:49
  • Your way is 10 front-end uops for 2 qwords with AVX1: vmovdqu / vpxor mem / vpand / vpandn / vpsrld / 2x vpshufb / vpaddb / vpsadbw / vpaddq. So I'm not surprised @user179156 didn't see any speedup from this vs. pure scalar 64-bit. This is like "sse-lookup" or "sse-lookup-original" (no unroll) in WojciechMula's results for Sandybridge https://github.com/WojciechMula/sse-popcount/blob/master/results/sandybridge-e/sandybridgeE-i7-3930k-g%2B%2B5.3-avx.rst where "cpu" (`_popcnt64`) and `__builtin_popcountll` are always faster. – Peter Cordes Feb 22 '22 at 22:55
  • Of course, SIMD for the XOR work helps some, but not enough to make it worth doing XMM `vpshufb` popcount I think, especially without unrolling. Wojciech's results for "builtin-popcnt-movdq" / "sse-cpu" show that SIMD load + scalar extract to feed popcount might be ok, unless the compiler is optimizing away the SIMD in those source versions. – Peter Cordes Feb 22 '22 at 22:57
  • Oh, and 9 of those 10 uops need SIMD ALU execution units, and there are only 3 of those. With good scheduling, that should keep all 3 of them busy, though, since most can run on enough different ports. Still that's 3 cycles for 2 xor+count ops, vs. 2 for 2 with scalar, on an Intel CPU with AVX1 but not AVX2 (i.e. Sandybridge or Ivybridge). It's a Mac so we know it's Intel. With unrolling for scalar, loop overhead should be the same 2 or 3 extra uops (that all compete for the same ALU ports; the port 6 scalar-only ALU port was new in Haswell which has AVX2). – Peter Cordes Feb 22 '22 at 23:10
  • Strict-aliasing UB in your cleanup code: `_mm_cvtsi32_si128( *(const int*)( p1 ) )` is type-punning / dereferencing in pure C, only passing an `int` object to the intrinsic. Same for the others. Use an intrinsic that takes a pointer, like `_mm_load_ss`, or `_mm_loadu_si32` if supported. And for 64-bit, `_mm_loadl_epi64` is a movq load even though it strangely takes a `__m128i*` pointer. – Peter Cordes Feb 22 '22 at 23:15
  • Of course, if you only have 32 or 64 bits left, scalar `popcnt` is definitely better, so use `memcpy` or a GNU C `typedef uint64_t aliasing_u64 __attribute__((aligned(4), may_alias));` – Peter Cordes Feb 22 '22 at 23:16