GOAL: identify intrinsics to convert 4 boolean "uint8_t" using a minimum number of aritmetic oeprations, ie, each{mask1 AND mask2}.
UPDATE: In order to optimize the code, I'm using SIMD in C++. In contrast to Loading 8 chars from memory into an __m256 variable as packed single precision floats goal is to handle/support masks for massive arrays. The latter is examplified using 'internal' mask-properties ("https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=10,13"):
uint8_t mask1[4] = {0, 1, 1, 0}; uint8_t mask2[4] = {1, 1, 0, 0}; float data[4] = {5, 4, 2, 1};
{ //! Naive code which works:
float sum = 0;
for(int i = 0; i < 4; i++) {
if(mask1[i] && mask2[i]) {sum += data[i];}
}
}
From above we observe the use of masks combiend with simple arithmetic: though the above set of operations is supported by optimized arithmetic, the 'internals' have several weaknesses': (a) constraints the number of operations and (b) place requirements on updated compilers (which is not always the case).
CONTEXT: Challenge concerns the conversion from the "char" data-type to "float" data-type. In order to demonstrate the error in my code, here's a short extract:
//! Setup, a setup which is wrong as mask1 and mask2 are chars and not floats.
#include <emmintrin.h>
#include <x86intrin.h>
char mask1[4] = {0, 1, 0, 1};
char mask2[4] = {1, 0, 0, 1};
const int j = 0;
//! The logics, which is expected to work correct for flroats, ie, not chars.
const __m128 vec_empty_empty = _mm_set1_ps(0);
const __m128 vec_empty_ones = _mm_set1_ps(1);
const __m128 term1 = _mm_load_ps(&rmul1[j2]);
const __m128 term2 = mm_load_ps(&rmul2[j2]);
__m128 vec_cmp_1 = _mm_cmplt_ps(term1, vec_empty_empty);
__m128 vec_cmp_2 = _mm_cmplt_ps(term2, vec_empty_empty);
//! Intersect the values: included to allow other 'empty values' than '1'.
vec_cmp_1 = _mm_and_ps(vec_cmp_1, vec_empty_ones);
vec_cmp_2 = _mm_and_ps(vec_cmp_2, vec_empty_ones);
//! Seperately for each 'cell' find the '1's which are in both:
__m128 mask = _mm_and_ps(vec_cmp_1, vec_cmp_2);
The result of above is to be used to intersect (ie, multiply) a float vector float arr[4]
. Therefore, if someone do have any suggestions for how to convert a SIMD char vector into a float SIMD vector, I'd be more than thankful! ;)