5

I would like to take the result of an 8-bit vertical SIMD comparison between 256-bit vectors and pack the bits into the lowest byte of each 32-bit element for a vpshufb lookup on the lowest bytes. This isn't terribly difficult with AVX-512 (replace the & with a masked move if using 512-bit vectors):

__m256i cmp_8_into_32(__m256i a, __m256i b) {
    return _mm256_popcnt_epi32(_mm256_cmpeq_epi8(a, b)
        & _mm256_set1_epi32(0xff0f0301 /* can be any order */));
}

That's three uops and, assuming perfect scheduling, a throughput of 1 according to uops.info—not bad. Alas, vpopcntd isn't in AVX2. What's the optimal way to do this operation there? The best I can think of is to mask the pairs of bits at indices 7,8 and 15,16, then perform two constant-amount vpsrld and a vpor. So that's 6 uops, throughput of 2.5 ish. Not bad, but I wonder if there's something better.

Michael M.
  • 10,486
  • 9
  • 18
  • 34
Ovinus Real
  • 528
  • 3
  • 10
  • I don't fully understand what you want to do here. Is it equivalent to doing a 4-byte horizontal sum on `_mm256_cmpeq_epi8(a, b) & _mm256_set1_epi32(0x08040201 /* can be any order */)`? – chtz Oct 20 '22 at 06:33
  • @chtz Yes, exactly. – Ovinus Real Oct 20 '22 at 07:39
  • 1
    `pmaddubsw` + `pmaddwd` can be helpful for horizontal sums within chunks, trading latency for fewer uops (as long as you don't bottleneck on that port). (With 8,4,2,1 directly since you're not feeding popcnt). Or possibly unpack odd/even dwords within qwords for `psadbw` against zero, then recombine with shift/blend. – Peter Cordes Oct 20 '22 at 07:58
  • @PeterCordes Yes, I realized just now. Thanks! Do you know the original purpose of those instructions, btw? – Ovinus Real Oct 20 '22 at 07:59
  • The original purpose? Probably as part of integer dot-product type of things for `pmaddwd`, summing into 32-bit accumulators (which you'd then add vertically with `paddd`). I assume that comes up in various DSP problems. As for `pmaddubsw`, it was only introduced later in SSSE3, and I assume there was a specific use-case for it, given the special combination of signed and unsigned operands. – Peter Cordes Oct 21 '22 at 00:09

2 Answers2

5

Following chtz's comment (thanks!), I realize it's actually fairly easy:

__m256i cmp_8_into_32_1(__m256i a, __m256i b) {
    const __m256i weights = _mm256_set1_epi32(0x08040201);
    const __m256i all_n1 = _mm256_set1_epi16(-0x1);

    __m256i cmp = _mm256_cmpeq_epi8(a, b);
    __m256i hsum16 = _mm256_maddubs_epi16(weights, cmp);

    return _mm256_madd_epi16(hsum16, all_n1);
}

Peter Cordes's suggestion saved an additional vpand. The two multiply–add instructions both run on either port 0 or 1, so this has the same throughput as the original popcount-based solution, although with a latency of about 11 instead of 5.

Ovinus Real
  • 528
  • 3
  • 10
  • You might be able to save the `_mm256_and_si256` (which you wrote as `&`, so it won't be portable to MSVC). The compare result is signed integer `-1` or `0`, so use it as the signed operand to `pmaddwd` with `-8, -4, -2, -1`, i.e. `_mm256_set1_epi32(0xf8fcfeff)` if I have that right. – Peter Cordes Oct 20 '22 at 08:03
  • You don't need to cram everything into one `return` statement; use some temporaries so you can keep the constants closer together that work more directly with each other. This would be pretty unreasonable if I hadn't happened to comment the same idea (within seconds of you posting this answer :P) Assembly language is one line per instruction, and I find that's good style for intrinsics, other than stuff like `_mm_set1` constants as operands, or sometimes a load or cast. (So it's one intrinsic per asm instruction you expect the compiler to emit.) – Peter Cordes Oct 20 '22 at 08:05
  • Sure. I kind of typed this up in a rush (actually hadn't tested it, but it does appear to work). Thanks for the tips and the idea for saving the `vpand`. – Ovinus Real Oct 20 '22 at 08:12
  • Oh, good idea to use `set1(-1)` as the `pmaddwd` multiplier, instead of inverting the weights. I think that still works mathematically and without saturating. set1(-1) is also a cheaper constant, the compiler can make it with `vpcmpeqd same,same` instead of a load. – Peter Cordes Oct 20 '22 at 08:15
3

Uses 1 multiply:

__m256i cmp_8_into_32(__m256i a, __m256i b) {
    __m256i cmp = _mm256_cmpeq_epi8(a, b);
    __m256i msk = _mm256_and_si256(cmp, _mm256_set1_epi32(0x08040201));
    __m256i hsum = _mm256_madd_epi16(msk, _mm256_set1_epi8(1));
    return _mm256_srli_epi16(hsum, 8);
}

A 32-bit multiply (_mm256_mullo_epi32) is not used because it is slow.


If the results are not needed "in-lane" then one could use a _mm256_packs_epi16 immediately after the comparison to process twice as much data at once. If you don't need all of the possible states (say we don't care about lowest byte matches) then you could do 4x as much per instruction. If the results from the vpshufb lookup are getting gathered together then there may be other possible optimizations...

aqrit
  • 792
  • 4
  • 14
  • This is clever! And slightly better, so I shall accept it. – Ovinus Real Oct 20 '22 at 21:22
  • Oh nice, I'd been thinking about how to maybe use a 32-bit multiply + shift to sum up the bytes (like the `* 0x01010101` in [Count the number of set bits in a 32-bit integer](https://stackoverflow.com/q/109023), or the bit-movement magic multipliers in [How to create a byte out of 8 bool values (and vice versa)?](https://stackoverflow.com/q/8461126)), but `vpmulld` alone is 2 uops on most CPUs so that wouldn't have been a win. Well spotted that with these constants we can do it in 16-bit halves that get added with `vpmaddwd`, even avoiding carry-out from the low 16. – Peter Cordes Oct 21 '22 at 00:05