Is it possible to popcount __m256i and store result in 8 32-bit words instead of the 4 64-bit using Wojciech Mula algorithm's?

Question

I have recently discovered that AVX2 doesn't have a popcount for __m256i and the only way I found to do something similar is to follow the Wojciech Mula algorithm's:

__m256i count(__m256i v) {
    __m256i lookup = _mm256_setr_epi8(0, 1, 1, 2, 1, 2, 2, 3, 1, 2,
                     2, 3, 2, 3, 3, 4, 0, 1, 1, 2, 1, 2, 2, 3,
                     1, 2, 2, 3, 2, 3, 3, 4);
    __m256i low_mask = _mm256_set1_epi8(0x0f);
    __m256i lo =_mm256_and_si256(v,low_mask);
    __m256i hi = _mm256_and_si256( _mm256_srli_epi32(v, 4), low_mask);
    __m256i popcnt1 = _mm256_shuffle_epi8(lookup,lo);
    __m256i popcnt2 = _mm256_shuffle_epi8(lookup,hi);
    __m256i total = _mm256_add_epi8(popcnt1,popcnt2);

    return _mm256_sad_epu8(total,_mm256_setzero_si256());
}

Wojciech Muła, Nathan Kurz, Daniel Lemire, Faster Population Counts Using AVX2 Instructions, Computer Journal 61 (1), 2018

The problem is that it return me the sum of 8 short into long instead of the sum of 4 short into int.

What's currently happening:
I have __m256i x which contain those 8 32-bit int:

01101011111000011100000000000000
01110101011010010111100000000000
10100100011011000101010000000000
11101010100001001111000000000000
10010011111111001001010000000000
00011110101100101000000000000000
00011101011000111011000000000000
10011011100010100000110000000000

__m256i res = count(x);

res contain:

24
21
22
21

The result is 4 long 64-bit

Expectation:

I have __m256i x which contain thoses 8 32-bit int:

01101011111000011100000000000000
01110101011010010111100000000000
10100100011011000101010000000000
11101010100001001111000000000000
10010011111111001001010000000000
00011110101100101000000000000000
00011101011000111011000000000000
10011011100010100000110000000000

__m256i res = count(x);

res contain:

11
13
10
11
12
9
11
10

The result is 8 int 32-bit.

Hope I was clear, don't hesitate to ask me for more precision.

Thanks.

This https://github.com/kimwalisch/libpopcnt may help. – Arundale Ramanathan Nov 14 '21 at 13:17 — Arundale Ramanathan, Nov 14 '21 at 13:17

score 3 · Accepted Answer · edited Apr 29 '22 at 02:24

AVX-512VPOPCNTDQ has _mm256_popcnt_epi32 to popcount in 32-bit chunks, also a 64-bit chunk size version. Outside of Xeon Phi, it's new in Ice Lake which also introduced AVX512BITALG which also has byte and word (16-bit) chunk sizes of vpopcnt.

With AVX2

The original code you are quoting relies on the _mm256_sad_epu8 intrinsic, and it is specifically for summing up bytes within 64-bit words.

To get the same result, with sums of 32-bit words, you need to do something slightly different. The following should work:

__m256i popcount_pshufb32(__m256i v) {

  __m256i lookup = _mm256_setr_epi8(0, 1, 1, 2, 1, 2, 2, 3, 1, 2,
                 2, 3, 2, 3, 3, 4, 0, 1, 1, 2, 1, 2, 2, 3,
                 1, 2, 2, 3, 2, 3, 3, 4);
  __m256i low_mask = _mm256_set1_epi8(0x0f);
  __m256i lo = _mm256_and_si256(v, low_mask);
  __m256i hi = _mm256_and_si256(_mm256_srli_epi16(v, 4), low_mask);
  __m256i popcnt1 = _mm256_shuffle_epi8(lookup, lo);
  __m256i popcnt2 = _mm256_shuffle_epi8(lookup, hi);
  __m256i sum8 = _mm256_add_epi8(popcnt1, popcnt2);
  return _mm256_srli_epi32(
      _mm256_mullo_epi32(sum8, _mm256_set1_epi32(0x01010101)), 24);
      // vpmulld is slowish (2 uops) on most recent Intel CPUs
      // but still single-uop on AMD
}

So we replaced _mm256_sad_epu8 by a multiplication and a shift. That should be reasonable. In my tests, it is slightly slower than the original 64-bit version, but the difference is relatively small.

You can get slightly better performance on Intel at the cost of one more vector constant, by using a different two instructions to accumulate from bytes to 32-bit chunks. AMD Zen1/2/3 is at least as efficient with the above version as below.

32-bit SIMD-integer multiply is 2 uops on recent Intel CPUs (both for the SIMD-integer-multiply units), but the pairwise multiply-accumulate instructions (8->16 and 16->32) are a single uop each. (https://uops.info/) This requires one more constant, but the same number of instructions, for fewer uops especially if the compiler can reuse the constants in a loop.

__m256i popcount_pshufb32(__m256i v) {

  __m256i lookup = _mm256_setr_epi8(0, 1, 1, 2, 1, 2, 2, 3, 1, 2,
                 2, 3, 2, 3, 3, 4, 0, 1, 1, 2, 1, 2, 2, 3,
                 1, 2, 2, 3, 2, 3, 3, 4);
  __m256i low_mask = _mm256_set1_epi8(0x0f);
  __m256i lo = _mm256_and_si256(v, low_mask);
  __m256i hi = _mm256_and_si256(_mm256_srli_epi16(v, 4), low_mask);
  __m256i popcnt1 = _mm256_shuffle_epi8(lookup, lo);
  __m256i popcnt2 = _mm256_shuffle_epi8(lookup, hi);
  __m256i sum8 = _mm256_add_epi8(popcnt1, popcnt2);
  return _mm256_madd_epi16(_mm256_maddubs_epi16(sum8, _mm256_set1_epi8(1)),
                       _mm256_set1_epi16(1));
}

Are you sure that `_mm256_mul_epi32` is going to work? It neglects bits 32-63. I would expect a `_mm256_mullo_epi32` instead of `_mm256_mul_epi32`. Note that `_mm256_mullo_epi32` is quite slow on Intel. A horizontal sum with shift-add-shift-add-mask_0xFF might be as fast. — wim, Jun 29 '18 at 17:50
After reading Agner Fog's instruction tables more carefully I think that the multiplication idea is probably faster than repeated shifting and adding, at least for Intel, maybe not for AMD Ryzen. — wim, Jun 29 '18 at 18:07
@wim: `vpmulld` is 2 dependent uops on Intel AVX2 CPUs, 10c latency. Better to go from 8->16->32 with `pmaddubsw` and `pmaddwd`, with multipliers of `_mm256_set1_epi8(1)` and `_mm256_set1_epi16(1)`. Same number of multiply uops, but you don't need the `srli` — Peter Cordes, Jun 29 '18 at 18:40
@PeterCordes: That is a nice solution for the horizontal sum. I wasn't aware of this particular series of instructions. — wim, Jun 29 '18 at 19:51
@wim: credit to @PaulR, I think I picked that up from one of his answers. (But normally `psadbw` is better if you want to go all the way to a 64-bit horizontal sum, unless you're doing something like [How to implement atoi using SIMD?](https://stackoverflow.com/q/35127060) where you scale different elements by different amounts.) — Peter Cordes, Jun 29 '18 at 20:07
@DanielLemire @wim Thanks a lot both of you ! It's work well with `_mm256_mullo_epi32` ! — yatsukino, Jul 02 '18 at 07:52
I wonder if there's anything to gain from masking the pshufb results (or `vpblendd` with zeros) and doing `psadbw` twice, once for the odd dwords, once for the even dwords. But then you'd need to shuffle those results together, like `vpsllq`/`vpblendd`, unless a single shuffle can do it? `vshufps` can grab the dwords you want, but not put them in the desired order. — Peter Cordes, Apr 29 '22 at 02:03

Is it possible to popcount __m256i and store result in 8 32-bit words instead of the 4 64-bit using Wojciech Mula algorithm's?

1 Answers1

With AVX2