0
#include <immintrin.h>

static const unsigned char LUT[16] = { 0xE4, 0x24, 0x34, 0x04, 
                                       0x38, 0x08, 0x0C, 0x00, 
                                       0x39, 0x09, 0x0D, 0x01, 
                                       0x0E, 0x02, 0x03, 0x00 };

int main( ) {
    float input[4] = { -1.0f, 2.0f, 3.0f, -4.0f };
    float output[4] = {0};

    __m128 data = _mm_loadu_ps( input );
    __m128 mmask = _mm_cmpge_ps( data, _mm_setzero_ps( ) );
    int shufctr = _mm_movemask_ps( mmask );

    __m128 res = _mm_shuffle_ps( data, data, LUT[shufctr] );
    _mm_storeu_ps( output, res );
}

I am meaning to use code similar to the above to left pack an array of floats that pass the compare into another but it's returning the error 'the last argument must be an 8-bit immediate.' How can I achieve this?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Timothy s
  • 9
  • 2
  • See [AVX2 what is the most efficient way to pack left based on a mask?](https://stackoverflow.com/q/36932240) (which mentions the SSSE3 way to start with, which is this with `_mm_shuffle_epi8`, which doesn't efficiently extend to more elements per vector). – Peter Cordes Aug 27 '21 at 16:22
  • "immediate" means a constant embedded in the machine code. The only way you could literally make a lookup table for it would be with self-modifying machine code, which would cause a "pipeline nuke" (`machine_clears.smc`) every time the function ran, making it slower than scalar. – Peter Cordes Aug 27 '21 at 16:24
  • thanks for the clarification Peter, i'll look into doing it the SSSE3 way. – Timothy s Aug 27 '21 at 17:26

1 Answers1

2

Function _mm_shuffle_ps() requires an unsigned 8-bit immediate as the third parameter; that means that the third parameter must be a compile-time known integer constant:

__m128 res = _mm_shuffle_ps(data, data, LUT[shufctr]); // WRONG
__m128 res = _mm_shuffle_ps(data, data, foo()); // WRONG
__m128 res = _mm_shuffle_ps(data, data, bar); // WRONG
__m128 res = _mm_shuffle_ps(data, data, 250); // CORRECT

A possible (not-so-great) approach to solve the problem:

...
int shufctr = _mm_movemask_ps(mmask);
__m128 res;

if (shufctr == 0) {
  res = _mm_shuffle_ps(data, data, 0xE4); // LUT[0] == 0xE4
}
else if (...) {
  ...
}
...

EDIT (adding information given by user Peter Cordes in a comment):

You may also take a look at SSSE3 pshufb or AVX1 vpermilps. Both of these instructions use a shuffle-control vector (runtime variable) rather than an immediate constant that must be embedded in the instruction stream. So you can use the movemask result to look up from a table of shuffle control vectors. SSE2 doesn't have any variable-control shuffles, only variable-count bit-shifts.

Luca Polito
  • 2,387
  • 14
  • 20
  • is there no way of defining a compile-time lookup table/array of sorts that will let me do what I'm trying to do without endless if elses? – Timothy s Aug 27 '21 at 16:18
  • @Timothys: Yes, with SSSE3 `pshufb` or AVX1 `vpermilps`. Both of those use a shuffle-control *vector* (runtime variable) rather than an immediate constant that must be embedded in the instruction stream. So you can use the movemask result to look up from a table of shuffle control vectors. SSE2 doesn't have any variable-control shuffles, only variable-count bit-shifts. – Peter Cordes Aug 27 '21 at 16:21
  • @PeterCordes Thanks for the additional info, I've added your comment in my answer. – Luca Polito Aug 27 '21 at 16:33