I have recently discovered that AVX2 doesn't have a popcount for __m256i and the only way I found to do something similar is to follow the Wojciech Mula algorithm's:
__m256i count(__m256i v) {
__m256i lookup = _mm256_setr_epi8(0, 1, 1, 2, 1, 2, 2, 3, 1, 2,
2, 3, 2, 3, 3, 4, 0, 1, 1, 2, 1, 2, 2, 3,
1, 2, 2, 3, 2, 3, 3, 4);
__m256i low_mask = _mm256_set1_epi8(0x0f);
__m256i lo =_mm256_and_si256(v,low_mask);
__m256i hi = _mm256_and_si256( _mm256_srli_epi32(v, 4), low_mask);
__m256i popcnt1 = _mm256_shuffle_epi8(lookup,lo);
__m256i popcnt2 = _mm256_shuffle_epi8(lookup,hi);
__m256i total = _mm256_add_epi8(popcnt1,popcnt2);
return _mm256_sad_epu8(total,_mm256_setzero_si256());
}
The problem is that it return me the sum of 8 short into long instead of the sum of 4 short into int.
What's currently happening:
I have __m256i x which contain those 8 32-bit int:
- 01101011111000011100000000000000
- 01110101011010010111100000000000
- 10100100011011000101010000000000
- 11101010100001001111000000000000
- 10010011111111001001010000000000
- 00011110101100101000000000000000
- 00011101011000111011000000000000
- 10011011100010100000110000000000
__m256i res = count(x);
res contain:
- 24
- 21
- 22
- 21
The result is 4 long 64-bit
Expectation:
I have __m256i x which contain thoses 8 32-bit int:
- 01101011111000011100000000000000
- 01110101011010010111100000000000
- 10100100011011000101010000000000
- 11101010100001001111000000000000
- 10010011111111001001010000000000
- 00011110101100101000000000000000
- 00011101011000111011000000000000
- 10011011100010100000110000000000
__m256i res = count(x);
res contain:
- 11
- 13
- 10
- 11
- 12
- 9
- 11
- 10
The result is 8 int 32-bit.
Hope I was clear, don't hesitate to ask me for more precision.
Thanks.