1

I have a deceptively innocent-looking function f which is called in a tight loop and is causing a speed bottleneck. Any insights on how can I improve it?

#define N 48
// N = 47 is also relevant
int f(
const uint16_t * __restrict A,
const uint16_t * __restrict B,
const uint16_t * __restrict C) {
    int E = 0;
    for (int r = 0; r < N; ++r) {
        for (int s = r; s < N; ++s) {
            int pa = __builtin_popcount(A[r] & A[s]);
            int pb = __builtin_popcount(B[r] & B[s]);
            int pc = __builtin_popcount(C[r] & C[s]);
            E += pa*pb*pc;
        }
    }
    return E;
}

What I've tried:

(a) preprocessing the A, B, C arrays in linear time and entering the double loop only with those triplets for which pa*pb*pc would be non-zero. However, since the distribution of bits in A, B, C is uniform, almost nothing is filtered out.

(b) blocking to minimize cache misses

(c) repacking the A, B and C in uint64_ts and reworking popcount so that it processes 4 inputs at once

None of these helped. It seems that the sheer number of iterations (~1000) is the main issue. Is it something I'm missing here?

EDIT. I can assume that AVX2 is available on the target processor. Relevant compiler options currently include -O3, -mpopcnt, -funroll-loops, -mtune=native and -march=native

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
gekefe
  • 11
  • 4
  • 2
    Do the answers here help https://stackoverflow.com/questions/52161596/why-is-builtin-popcount-slower-than-my-own-bit-counting-function ? Compiling with `-march=native` (if possible)? – EdmCoff Jun 19 '23 at 14:28
  • Your question may be simplified to the following: given three `u16` variables `a`, `b`, `c`, what is the fastest way to calculate `popcnt(a)*popcnt(b)*popcnt(c)`. – wojas Jun 19 '23 at 16:11
  • 1
    What processor are you using? The popcount instruction may be slow or fast, depending on the processor. – Mark Adler Jun 19 '23 at 16:20
  • 2
    vectorize the loop. For example if you have AVX-512 then it can count on 512 bits at once. See [Counting 1 bits (population count) on large data using AVX-512 or AVX-2](https://stackoverflow.com/q/50081465/995714). Even AVX2 helps [Population count in AVX512](https://stackoverflow.com/a/61880898/995714) – phuclv Jun 19 '23 at 16:56
  • Because your N is constant you can unroll your loops trivially. – O. Jones Jun 19 '23 at 22:52
  • `-march=native` on what CPU? Are you tuning specifically for Zen 3 or Ice Lake for example, or does it have to work ok on a range of CPUs? – Peter Cordes Jun 20 '23 at 08:02
  • 1
    Is the whole 16 bit width of A, B, and C used? I'm asking as there might be optimisations for SSE/AVX popcount look-up if not. – Simon Goater Jun 20 '23 at 08:14
  • @SimonGoater: yes, all 16 bits are used – gekefe Jun 20 '23 at 13:35
  • What about N = 48. Is that fixed in stone? – Simon Goater Jun 20 '23 at 13:44
  • @SimonGoater: there is another possible value for N actually, 47 – gekefe Jun 20 '23 at 14:03
  • Cache blocking is irrelevant; your total data size is `48 * sizeof(uint16_t) * 3` = 288 bytes. That's four and a half cache lines, totally trivial. – Peter Cordes Jun 20 '23 at 17:49
  • It appears that to get a material performance gain you'll need to rely on specific hardware. AVX512 looks like it has some instructions from which this could benefit though manually vectorizing the function isn't a trivial exercise. The Apple M1 might be another option but Mark didn't provide any benchmarks, so I don't know how it would compare. The M1 is ARM based, so you could have issues if you need x86 like instructions for other stuff. Any SSE or AVX2 solution has a huge handicap compared to a hardware popcount instruction so I'm doubtful there's much if any gain there to be had. – Simon Goater Jun 21 '23 at 14:35

2 Answers2

3

Doing the lower triangular instead of upper triangular matrices, and splitting off the diagonal, resulted in optimizations that sped it up by a factor of 2.7 for me (on Clang 14.0.3 with -O3, Apple M1). It enabled the use of vector (NEON) instructions and some loop unrolling:

int f2(const uint16_t * __restrict a,
       const uint16_t * __restrict b,
       const uint16_t * __restrict c) {
    int e = 0;
    for (int r = 1; r < N; ++r)
        for (int s = 0; s < r; ++s) {
            int pa = __builtin_popcount(a[r] & a[s]);
            int pb = __builtin_popcount(b[r] & b[s]);
            int pc = __builtin_popcount(c[r] & c[s]);
            e += pa * pb * pc;
        }
    for (int r = 0; r < N; ++r) {
        int pa = __builtin_popcount(a[r]);
        int pb = __builtin_popcount(b[r]);
        int pc = __builtin_popcount(c[r]);
        e += pa * pb * pc;
    }
    return e;
}

I also tried using a table lookup instead of the popcount instruction, but that was slower on the M1 as well as on an Intel i7 (with -march=native). Clang 11 on the i7 was not able to do significantly better with this version than with the original (just a 10% improvement).

Mark Adler
  • 101,978
  • 13
  • 118
  • 158
  • 1
    A set of ready-to-use popcount implementations with benchmark is available at http://dalkescientific.com/writings/diary/popcnt.cpp – wojas Jun 19 '23 at 17:01
  • I have a Nehalem Core i5. I tried your function with march=native and its performance is very similar to the OP's - sometimes a bit faster, sometimes a bit slower at about 9k rdtsc 'cycles'. – Simon Goater Jun 19 '23 at 23:20
  • @SimonGoater: Without AVX2, clang auto-vectorizing popcount doesn't have nearly as much to gain. (Also, it probably doesn't pick a very good strategy for horizontal widening multiply-accumulate; https://godbolt.org/z/YGdYTW67j shows it not using any `pmaddubsw` or `pmaddwd`, only `pmullw` + `pmovzxwd` + `paddd`.) Also, `rdtsc` counts reference cycles, not core clock cycles, so unless you disabled turbo, that 9k TSC cycles could be off by a factor of 2 maybe if it's a low-power laptop. – Peter Cordes Jun 20 '23 at 08:13
  • @Mark: There have only been a few x86 CPUs with hardware `popcnt` supported but slow, and none of them were Intel. Intel CPUs with HW support have 3 cycle latency, 1 / clock throughput. Slow popcnt CPUs include AMD Bobcat (first-gen Jaguar) being quite bad (9 uops with 12c latency, 5c throughput), and Bulldozer/Piledriver/Steamroller having 2 cycle throughput for 16 and 32-bit (4c for 64-bit) but still single uop. – Peter Cordes Jun 20 '23 at 08:19
  • Note that `__builtin_popcnt` is *not* an intrinsic for the instruction the way `_popcnt64` is from immintrin.h: without `-mpopcnt` or a relevant `-march=`, `__builtin_popcount` still compiles, but to a call to a libgcc helper (SWAR bithack ending with a multiply in the last few years of GCC versions, previously a byte LUT IIRC.) – Peter Cordes Jun 20 '23 at 08:21
3

The OP didn't originally specify what hardware support could be assumed and in this case, the hardware makes a ton of difference. To support older hardware, here is a function which does the job requiring SSE2 only without requiring hardware support for popcounts. The SSE2 version uses the popcnt8() function I found at Fast counting the number of set bits in __m128i register.

static inline __m128i popcnt8(__m128i x) {
    const __m128i popcount_mask1 = _mm_set1_epi8(0x77);
    const __m128i popcount_mask2 = _mm_set1_epi8(0x0F);
    __m128i n;
    // Count bits in each 4-bit field.
    n = _mm_srli_epi64(x, 1);
    n = _mm_and_si128(popcount_mask1, n);
    x = _mm_sub_epi8(x, n);
    n = _mm_srli_epi64(n, 1);
    n = _mm_and_si128(popcount_mask1, n);
    x = _mm_sub_epi8(x, n);
    n = _mm_srli_epi64(n, 1);
    n = _mm_and_si128(popcount_mask1, n);
    x = _mm_sub_epi8(x, n);
    x = _mm_add_epi8(x, _mm_srli_epi16(x, 4));
    x = _mm_and_si128(popcount_mask2, x);
    return x;
}

The following function is very slow compared to the OP's function except when hardware popcount is not available. On my Nehalem i5 CPU I measured the OP's function at around 8000-9000 rdtsc 'cycles' with hardware popcount enabled and around 40000 without. The SSE2 (gcc -msse2) function measured about 19000 cycles. It does work without requiring changes for different values of N.

int f6(
const uint16_t * __restrict A,
const uint16_t * __restrict B,
const uint16_t * __restrict C) {
    int r, s, E = 0;    
    __m128i ABCr, ABCs, ABC[N];
    union {
      __m128i v;
      uint8_t u[16];
    } popcounts;
    for (r = 0; r < N; ++r) {
      ABC[r] = _mm_setr_epi16(A[r], B[r], C[r], 0, 0, 0, 0, 0);
    }
    for (r = 0; r < N; ++r) {
        ABCr = ABC[r];
        ABCs = popcnt8(ABCr);
        popcounts.v = _mm_bslli_si128(ABCs, 1);
        popcounts.v = _mm_add_epi8(popcounts.v, ABCs);
        E += (popcounts.u[1])*(popcounts.u[3])*(popcounts.u[5]);
        for (s = r+1; s < N; s++) {
            ABCs = ABC[s];
            ABCs = _mm_and_si128(ABCs, ABCr);
            ABCs = popcnt8(ABCs);
            popcounts.v = _mm_bslli_si128(ABCs, 1);
            popcounts.v = _mm_add_epi8(popcounts.v, ABCs);
            E += (popcounts.u[1])*(popcounts.u[3])*(popcounts.u[5]);
        }
    }
    return E;
}

While this function's performance won't impress anyone, I found writing and benchmarking it interesting and thought some others might find it interesting too. Since it's quite common to assume SSE2 support as a minimum and coding for multiple architectures can be very complex, I thought there was some value in sharing what I did. If the OP had asked for widely compatible code and assuming not more than SSE2 support, then this could have been a worthwhile improvement I think.

EDIT:

I made a faster SSE2 version of the function by reordering the calculation. It runs slightly faster than the hardware popcount enabled functions at about 5900 cycles. I know the OP wanted AVX2 but I think this approach is of interest if someone wants to create a manually vectorised version in AVX2 or AVX512.

SSE2 + AVX512:

_mm_popcnt_epi16 is an AVX512 intrinsic which if substituted in place of X_mm_popcnt_epi16 should give a decent performance gain I think, but I don't have any AVX512 supporting hardware to provide benchmarks.

static inline __m128i X_mm_popcnt_epi16(__m128i v) {
// Taken from https://stackoverflow.com/questions/6431692/tweaking-mits-bitcount-algorithm-to-count-words-in-parallel
    v = _mm_add_epi16(_mm_and_si128(v, _mm_set1_epi16(0x5555)), _mm_and_si128(_mm_srli_epi16(v, 1), _mm_set1_epi16(0x5555)));
    v = _mm_add_epi16(_mm_and_si128(v, _mm_set1_epi16(0x3333)), _mm_and_si128(_mm_srli_epi16(v, 2), _mm_set1_epi16(0x3333)));
    v = _mm_add_epi16(_mm_and_si128(v, _mm_set1_epi16(0x0f0f)), _mm_and_si128(_mm_srli_epi16(v, 4), _mm_set1_epi16(0x0f0f)));
    v = _mm_add_epi16(_mm_and_si128(v, _mm_set1_epi16(0x00ff)), _mm_srli_epi16(v, 8));
    return v;
}

static inline __m128i getBrs(
const uint16_t * __restrict A,
const uint16_t * __restrict B,
const uint16_t * __restrict C, uint32_t r, uint32_t s) {
  uint32_t i;  
  __m128i Evec, popA, popB, popC, temp, temp2, Ar, Br, Cr, As, Bs, Cs; 
  Evec = _mm_setzero_si128();
  /* getBrs does sth like...
  uint32_t EB = 0;  
  uint32_t j;  
  for (i = r; i<r+8; i++) {
    for (j = s; j<s+8; j++) {
      EB += __builtin_popcount(A[i] & A[j])*__builtin_popcount(B[i] & B[j])*__builtin_popcount(C[i] & C[j]);
    }
  }
  */  
  Ar = _mm_loadu_si128( (const __m128i*)&A[r]);
  Br = _mm_loadu_si128( (const __m128i*)&B[r]);
  Cr = _mm_loadu_si128( (const __m128i*)&C[r]);
  As = _mm_loadu_si128( (const __m128i*)&A[s]);
  Bs = _mm_loadu_si128( (const __m128i*)&B[s]);
  Cs = _mm_loadu_si128( (const __m128i*)&C[s]);
  for (i=0; i<8; i++) {
    As = _mm_bsrli_si128(As, 2);
    As = _mm_insert_epi16(As, A[s], 7);
    Bs = _mm_bsrli_si128(Bs, 2);
    Bs = _mm_insert_epi16(Bs, B[s], 7);
    Cs = _mm_bsrli_si128(Cs, 2);
    Cs = _mm_insert_epi16(Cs, C[s], 7);
    temp = _mm_and_si128(Ar, As);
    popA = X_mm_popcnt_epi16(temp); 
    temp = _mm_and_si128(Br, Bs);
    popB = X_mm_popcnt_epi16(temp); 
    temp = _mm_and_si128(Cr, Cs);
    popC = X_mm_popcnt_epi16(temp); 
    temp = _mm_mullo_epi16(popA, popB);
    temp2 = _mm_mullo_epi16(temp, popC);
    Evec = _mm_add_epi16(Evec, temp2);
    s++;
  }
  return _mm_madd_epi16(Evec, _mm_set1_epi16(1));
}

int f8(
const uint16_t * __restrict A,
const uint16_t * __restrict B,
const uint16_t * __restrict C) {
  uint32_t r, i,j, Ediag = 0, E = 0;
  __m128i Evec, popA, popB, popC, temp, temp2, Ar, Br, Cr; 
  Evec = _mm_setzero_si128();
  union {
    __m128i v;
    uint32_t u32[4];
  } popcounts;
  /*
  for (i = 0; i<N; i++) {
    Ediag += __builtin_popcount(A[i] & A[i])*__builtin_popcount(B[i] & B[i])*__builtin_popcount(C[i] & C[i]);
  }
  */
  for (r = 0; r < 48; r+=8) {
      Ar = _mm_loadu_si128( (const __m128i*)&A[r]);
      Br = _mm_loadu_si128( (const __m128i*)&B[r]);
      Cr = _mm_loadu_si128( (const __m128i*)&C[r]);
      popA = X_mm_popcnt_epi16(Ar); 
      popB = X_mm_popcnt_epi16(Br); 
      popC = X_mm_popcnt_epi16(Cr); 
      temp = _mm_mullo_epi16(popA, popB);
      temp2 = _mm_mullo_epi16(temp, popC);
      Evec = _mm_add_epi16(Evec, temp2);
  }
  popcounts.v = _mm_madd_epi16(Evec, _mm_set1_epi16(1));;
  Ediag = popcounts.u32[0] + popcounts.u32[1] + popcounts.u32[2] + popcounts.u32[3];
  Evec = _mm_setzero_si128();
  for (i = 0; i<N; i+=8) {
    Evec = _mm_add_epi32(Evec, getBrs(A,B,C,i,i));
    for (j = i+8; j<N; j+=8) {
      temp = getBrs(A,B,C,i,j);
      temp = _mm_add_epi32(temp, temp);
      Evec = _mm_add_epi32(Evec, temp);
    }
  }
  popcounts.v = Evec;
  E = popcounts.u32[0] + popcounts.u32[1] + popcounts.u32[2] + popcounts.u32[3];
  return (Ediag + E)/2;  
}
Simon Goater
  • 759
  • 1
  • 1
  • 7
  • 1
    They did specify AVX2 (implying x86 which hadn't previously been mentioned!), in an edit a couple hours before you posted this. – Peter Cordes Jun 20 '23 at 12:03
  • @PeterCordes Should've refreshed, oh well. I think there's some interesting and relevant information here anyway. I think there's such a big gap in performance between the SSE and hardware popcount functions that I'm not optimistic that AVX2 will beat the scalar function with hardware popcount instructions available. My guess is that AVX512 is the last hope of getting a materially faster function for x86 which seems plausible as it does have _mm512_popcnt_epi16 (vpopcntw). – Simon Goater Jun 20 '23 at 13:30
  • This vectorization strategy with 3 scalar extracts from each vector looks inefficient. Can't you keep things more "vertical", like 6 vectors, one each of A[r] and A[s], and same for B[] and C[]? And use `_mm_mullo_epi16` and `_mm_madd_epi16`, using 3 different vectors? – Peter Cordes Jun 20 '23 at 13:31
  • Also, if you're going to avoid SSSE3 `pshufb` and use an SWAR bithack within SIMD elements, you don't want to break it up into nibbles first. (Unless that SSE2 version is assuming that? It looks less efficient than the usual SWAR bithack ([Count the number of set bits in a 32-bit integer](https://stackoverflow.com/q/109023)) within SIMD elements. Should be shift/sub/AND, AND/shift+AND/add, shift/add/AND with a 3rd mask to get 8-bit groups, so shift-or-shuffle then `paddb`. 64-bit mode gives you enough vector registers to 3 constants, 6 data vectors, and some temps. – Peter Cordes Jun 20 '23 at 13:33
  • Yeah, AVX-512VBMI2 for `_mm256_popcnt_epi16` would be *blazing* fast at this on Zen4 or Ice Lake. The OP's N=48 shorts means we only need 3x 256-bit vectors to load the full array, then I guess loop across them doing 16-bit broadcasts. Since `pa*pb` definitely doesn't overflow 16 bits, we can keep things packed with `vpmullw` (the max value of 0xf * 0xf can't even overflow uint8_t 0xff). Then the final product with a vector of `pc` values using `vpmaddwd`, widening to 32-bit already. Then just `vpaddd` into an accumulator, and after the loop reduce to scalar. – Peter Cordes Jun 20 '23 at 13:40
  • 1
    @PeterCordes I originally wrote the function for SSSE3 popcount lookup and used the popcount8 function as a drop in replacement for SSE2 only. You're right it was inefficient, so I wrote a different function making better use of the 8 bit popcount function and now its quite a bit faster than the SSSE3 one. Thanks for the feedback. This is why I post stuff like this. – Simon Goater Jun 20 '23 at 15:53
  • You're still only getting 3/8ths of work per instruction that would be possible separate vectors, 8 A[] values, 8 B[] values, and 8 C[] values. And you need 3 byte extracts inside the inner loop since you used a union with `uint8_t u[16]`. 3x `pextrw` or `pextrb` would be bad enough, but without SSE4.1 you only have `pextrw` for efficient word extracts, so it probably has to do `pextrw` and an extra right shift. – Peter Cordes Jun 20 '23 at 17:25
  • I think it would be a lot faster to do unaligned loads to get all-against-all vs. broadcasted `A/B/C[s]`, or 16-bit broadcast loads of `set1_epi16(A[r])`. Or maybe two different A/B/C[s] broadcasts for each aligned A[r], so each loaded vector can get used multiple times. – Peter Cordes Jun 20 '23 at 17:26
  • Your new `X_mm_popcnt_epi16` could be optimized some: note that once the SWAR accumulators are wide enough, you can mask *after* adding. And the first step can use a `sub` instead of `add` to also only mask afterwards. Also, `_mm_srli_epi16(v, 8)` leaves the upper half zeroed, so the `and` is redundant. (Clang will spot at least that last optimization for you.) – Peter Cordes Jun 21 '23 at 19:11
  • `memcpy(&Ar, &A[r], 16);` would more normally be written as `__m128i Ar = _mm128_loadu_si128( (const __m128i*)&A[r] );`. The final `__m128i temp2 = _mm_mullo_epi16(temp, popC);` `_mm_add_epi16` in the loop can be `_mm_madd_epi16` / `_mm_add_epi32` so the reduction after the loop is only 4 `uint32_t` instead of 8 `uint16_t`, and there's no risk of overflow of a u16 element. See [Fastest way to do horizontal SSE vector sum (or other reduction)](https://stackoverflow.com/q/6996764) for doing hsums efficiently. – Peter Cordes Jun 21 '23 at 19:14
  • Your code already does a lot of vector ALU work; probably best to use unaligned loads inside the loop instead of `_mm_bsrli_si128` / `_mm_insert_epi16`, especially on Nehalem and later where unaligned loads are fast if they don't span a cache line. Even then, as long as it doesn't split a page, OoO exec can hide the latency. But yes, this final version is basically the strategy I had in mind and should adapt easily to AVX2 by widening everything. You could have `getBrs` return a `__m128i` of 32-bit elements so you don't need to horizontal-sum until the end. – Peter Cordes Jun 21 '23 at 19:19
  • @PeterCordes Thanks again for taking the trouble to look at my code. I took out the superfluous 'and' from X_mm_popcnt_epi16 which I just copied as is. It does make a bit of a difference. I'm now down to about 5900 cycles. The shifts and inserts are needed because it rotates the elements. I can't just copy them over from the original array. There's no possible overflow here 8x15^3 = 30^3 = 27000 but it would be an issue when widening for AVX512. Thanks again. – Simon Goater Jun 21 '23 at 21:43
  • Right, there isn't a problem in your current algorithm, when you inefficiently reduce all the way to scalar many times. You want to do vertical SIMD adds of 32-bit elements (not 16 because that could overflow), with only one reduction from a vector of 32-bit to a single scalar. So `pmaddwd` as part of that does a 16 to 32-bit reduction for free, with no mask constant, and you actually *want* the multiply so it's pure win. In cases where you're reducing 16-bit numbers that aren't products, `_mm_madd_epi16(v, _mm_set1_epi16(1))` is an efficient first step for throughput vs. shuffle/add. – Peter Cordes Jun 22 '23 at 01:45
  • `popcounts.v = _mm_madd_epi16(Evec, _mm_set1_epi16(1))` is a waste of an instruction. Like I've said repeatedly, `temp2 = _mm_madd_epi16(temp, popC)` / `Evec = _mm_add_epi32(Evec, temp2);` would do that step for free, for the same cost as `_mm_mullo_epi16` / `add_epi16`. And you could do `Evec = Ediag` instead of zeroing a separate counter, so you only ever need one horizontal sum. (At least now you've reduced it to 2 hsums, none inside inner loops.) – Peter Cordes Jun 22 '23 at 18:41