3

So first I'll just describe the task:

I need to:

  1. Compare two __m128i.
  2. Somehow do the bitwise and of the result with a certain uint16_t value (probably using _mm_movemask_epi8 first and then just &).
  3. Do the blend of the initial values based on the result of that.

So the problem is as you might've guessed that blend accepts __m128i as a mask and I will be having uint16_t. So either I need some sort of inverse instruction for _mm_movemask_epi8 or do something else entirely.

Some points -- I probably cannot change that uint16_t value to some other type, it's complicated; I doing that on SSE4.2, so no AVX; there's a similar question here How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)? but it's about avx and I'm very inexperienced with this so I cannot adopt the solution.

PS: I might need to do that for arm as well, would appreciate any suggestions.

Andrew S.
  • 467
  • 3
  • 12
  • The shorter critical path would be to convert the `uint16_t` to a mask for `_mm_and_si128`, since that can happen in parallel with comparing the __m128i inputs, and avoids a round-trip. See also [is there an inverse instruction to the movemask instruction in intel avx2?](https://stackoverflow.com/q/36488675) for a list of links, including [Convert 16 bits mask to 16 bytes mask](https://stackoverflow.com/a/67203617) which has an SSSE3 version. – Peter Cordes Jul 07 '22 at 20:29

1 Answers1

5

When you do _mm_movemask_epi8 after a vector comparison, which produces -1 for true and 0 for false, you'll get a 16-bit integer (assuming SSE only) having the nth bit set for the nth byte equal to -1 in the vector.

The following is the reverse (inverse?) operation.

static inline __m128i bitMaskToByteMask16(int m) {
  __m128i sel = _mm_set1_epi64x(0x8040201008040201);
  return _mm_cmpeq_epi8(
    _mm_and_si128(
      _mm_shuffle_epi8(_mm_cvtsi32_si128(m),
        _mm_set_epi64x(0x0101010101010101, 0)),
      sel),
    sel);
}

Note that you might want to do a bitwise operation with the vector mask converted from an integer mask, without going back and forth between integer ops and vector ops.

xiver77
  • 2,162
  • 1
  • 2
  • 12
  • Thank you. Can you elaborate a bit on the details? For example I am unsure as to why this takes int instead of uint16_t? Also, what does it do exactly and what are those magic values? – Andrew S. Jul 07 '22 at 14:45
  • 1
    @AndrewS. `int` is `int32_t` on platforms that support intel intrinsics. `_mm_cvtsi32_si128` (`movd`) takes an `int`, so an unnecessary zero extension might happen if you put a `uint16_t`. Also, the high 16 bits of the `int` argument (`m`) is ignored. – xiver77 Jul 07 '22 at 14:53
  • @AndrewS. The Intel intrinsics guide website ([link](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#)) explains in detail what exactly each of those intrinsics do with certain magic-value inputs. – xiver77 Jul 07 '22 at 14:54
  • The part I don't really understand is that -- since the flag I need to do bitwise and with is `uint16_t` -- can I pass it into this function? Or will the result be really wrong after that? – Andrew S. Jul 07 '22 at 15:11
  • @AndrewS. I'd say yes, but if you're not sure what exactly `bitMaskToByteMask16` does, you can put arbitrary values as input and see what the output is (with a custom print function, for example). – xiver77 Jul 07 '22 at 15:26
  • Okay, I'm halfway through incorporating that in my algo, but the catch I found is -- is there a way to compare unsigned packed integers with SSE instructions? It seems like there isn't and I'm wondering why? – Andrew S. Jul 07 '22 at 16:41
  • @AndrewS. You can compare the *equality* of unsigned integers interpreting them as signed numbers, but if you want, for example, `0xff` (`-1`) to be greater than `0x7f` (`127`), you unfortunately need AVX512. It's not even possible with plain AVX. If you need this kind of operation in your algorithm, I think that's worth a separate question regarding how to do it efficiently with SSE. – xiver77 Jul 07 '22 at 16:52
  • 1
    @xiver77 There’re fast SSE2 instructions to compute minimum or maximum of unsigned bytes. `min( a, b ) == a` expression is equal to `a <= b` and is the inverse of `a > b` – Soonts Jul 10 '22 at 17:23
  • @AndrewS. Maybe you got this solution yourself, but have a look at Soont's comment above. I totally missed that part. In intrinsics, there is `_mm_{min|max}_epu{8|16|32}`, which is applied to 2 *unsigned* integers. – xiver77 Jul 11 '22 at 11:46
  • But how do I then Invert the result? – Andrew S. Jul 11 '22 at 12:25
  • Also, @xiver77, what do I change/add, if I need this function also for __m128 and for __m128d ? If you could extend your answer -- I'd really appreciate it. – Andrew S. Jul 11 '22 at 13:03
  • @AndrewS. Sometimes you don’t need to: use `_mm_andnot_si128` instead of `_mm_and_si128`, flip order of arguments of `_mm_blendv_epi8`. Or if you’re using these vectors for `_mm_movemask_epi8`, flip the bits afterwards with `^ 0xF`. Or if you really need the flipped vector, use `_mm_xor_si128` with `_mm_set1_epi32( -1 )` second argument. – Soonts Jul 11 '22 at 13:13
  • @AndrewS. To answer your second question, `__m128`, `__m128i`, and `__m128d` are all same `xmm` registers on hardware. Just use a `cast` intrinsic to match the appropriate type. Those casts don't emit an instruction, but they can consume few cycles to move between integer units and float units. – xiver77 Jul 11 '22 at 13:37
  • That said, for example, if you want to do a bitwise and, use `pand` (`_mm_and_si128`) for integers and `andps` (`_mm_and_ps`) for floats, so that you don't need a cast, but sometimes a cast is inevitable. – xiver77 Jul 11 '22 at 13:38
  • @AndrewS. Also, the Intel intrinsics are mostly a wrapper that matches a certain hardware instruction. It's different from usual libraries, and it assumes you have a basic understanding about x86 hardware. It's not much different from writing directly in assembly. – xiver77 Jul 11 '22 at 13:49