0

GOAL: identify intrinsics to convert 4 boolean "uint8_t" using a minimum number of aritmetic oeprations, ie, each{mask1 AND mask2}.

UPDATE: In order to optimize the code, I'm using SIMD in C++. In contrast to Loading 8 chars from memory into an __m256 variable as packed single precision floats goal is to handle/support masks for massive arrays. The latter is examplified using 'internal' mask-properties ("https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=10,13"):

  uint8_t mask1[4] = {0, 1, 1, 0};  uint8_t mask2[4] = {1, 1, 0, 0}; float data[4] = {5, 4, 2, 1};
  { //! Naive code which works:                                                                                                                                                                                 
    float sum = 0;
    for(int i = 0; i < 4; i++) {
      if(mask1[i] && mask2[i]) {sum += data[i];}
    }
  }

From above we observe the use of masks combiend with simple arithmetic: though the above set of operations is supported by optimized arithmetic, the 'internals' have several weaknesses': (a) constraints the number of operations and (b) place requirements on updated compilers (which is not always the case).

CONTEXT: Challenge concerns the conversion from the "char" data-type to "float" data-type. In order to demonstrate the error in my code, here's a short extract:

//! Setup, a setup which is wrong as mask1 and mask2 are chars and not floats.
#include <emmintrin.h>
#include <x86intrin.h>                                                               

char mask1[4] = {0, 1, 0, 1};
char mask2[4] = {1, 0, 0, 1};
const int j = 0;

//! The logics, which is expected to work correct for flroats, ie, not chars.
const __m128 vec_empty_empty = _mm_set1_ps(0);              
const __m128 vec_empty_ones = _mm_set1_ps(1);
const __m128 term1  = _mm_load_ps(&rmul1[j2]); 
const __m128 term2  = mm_load_ps(&rmul2[j2]);
__m128 vec_cmp_1 = _mm_cmplt_ps(term1, vec_empty_empty); 
__m128 vec_cmp_2 = _mm_cmplt_ps(term2, vec_empty_empty); 

//! Intersect the values: included to allow other 'empty values' than '1'.
vec_cmp_1 =  _mm_and_ps(vec_cmp_1, vec_empty_ones);
vec_cmp_2 = _mm_and_ps(vec_cmp_2, vec_empty_ones);

//! Seperately for each 'cell' find the '1's which are in both:
__m128 mask = _mm_and_ps(vec_cmp_1, vec_cmp_2); 

The result of above is to be used to intersect (ie, multiply) a float vector float arr[4]. Therefore, if someone do have any suggestions for how to convert a SIMD char vector into a float SIMD vector, I'd be more than thankful! ;)

Community
  • 1
  • 1
  • 2
    Could you provide a non-simd [mcve](http://stackoverflow.com/help/mcve), including in- and expected outputs, of what you're trying to achieve ? – Pixelchemist May 23 '16 at 23:45
  • 1
    Possible duplicate of [Loading 8 chars from memory into an \_\_m256 variable as packed single precision floats](http://stackoverflow.com/questions/34279513/loading-8-chars-from-memory-into-an-m256-variable-as-packed-single-precision-f) – Peter Cordes May 24 '16 at 01:03
  • thanks for the answers: wrt. @Pixelchemist I have now made the answer more detailed. – Ole Kristian Ekseth May 24 '16 at 13:37
  • Wrt. suggestion by @PeterCordes the referred-to suggestion only desccribe scalar operations, ie, does not cover the use of vector-based optimizations: in brief, the suggestion results in a more than 2x performance-dealy (when compared to the non-masked laternative). – Ole Kristian Ekseth May 24 '16 at 13:38
  • Ok, that's a different question from what I though. Updated my answer. – Peter Cordes May 24 '16 at 14:05

1 Answers1

2

Use SSE4.1 pmovsxbd or pmovzxbd to sign or zero extend a block of 4 bytes to a 16B vector of 32bit integer elements.

Note that using pmovzxbd (_mm_cvtepu8_epi32) as a load seems to be impossible to write both safely and efficiently, because there isn't an intrinsic with a narrower memory operand. (Update: Some modern compilers are able to fold a narrow load like _mm_loadu_si32 into a memory source operand for pmovzx, e.g. clang but not GCC: https://godbolt.org/z/KPxboPecr)

To do the comparison part, use pcmpeqd to generate a mask of all-zero or all-one bits in elements (i.e. -1). Use that to mask the vector of FP data. (all-zeros is the bit representation of 0.0 in IEEE floats, and 0.0 is the additive identity.)


If your elements are always just 0 or 1, you could use a uint32_t to hold all four bytes and use a scalar AND (C's & operator) as a SWAR implementation of all four mask1[i] && mask2[i] checks. Get that integer into a vector and pmovsxbd. This would work better if your elements were actually 0 and -1 (all-ones), otherwise you need an extra step to get a vector mask. (e.g. pcmpeqb against and all-zero vector).

If you can't use -1 instead of 1, then your best bet is probably to still unpack both masks to 32bit elements and pcmpeqd.

The general idea is:

          // mask1 = _mm_loadu_si32(something)  // movd load if necessary
__m128i m1vec = _mm_cvtepi8_epi32(mask1);         // where mask1 has to be a __m128i vector already, not a 4byte memory location.
__m128i m2vec = _mm_cvtepi8_epi32(mask2);         // pmovsx

// sign-extension turns each 0 or -1 byte into a 0 or -1 dword (32bit) element

__m128i mask = _mm_and_si128(mask1, mask2);
// convert from 0/1 to 0/-1 if necessary.  I'm assuming the simple case.

__m128 masked_floats = _mm_and_ps(floats, _mm_castsi128_ps(mask));   // 0.0 or original value

sum = _mm_add_ps(sum, masked_floats);

If mask elements can be something other than 0 / -1, you might need to booleanize them each separately with _mm_cmpeq_epi32(m1vec, _mm_setzero_si128()) or something. (That turns non-zeros into zero and vice versa)

See the tag wiki for links, esp. https://software.intel.com/sites/landingpage/IntrinsicsGuide/

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847