I am writing a C function with SSE2 intrinsics to essentially compare 4 32 bit integers and check to see which are greater than zero, and give that result in the form of a 16 bit mask. I am using the following code to do this
#include <x86intrin.h>
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
static void cmp_example(void) {
const uint32_t byte_vals[] = {0, 5, 0, 3};
__m128i got_data = _mm_load_si128((__m128i const*)byte_vals);
__m128i cmp_data = _mm_setzero_si128();
__m128i result = _mm_cmpgt_epi32 (got_data, cmp_data);
int mask_result = _mm_movemask_epi8(result);
printf("Result 0x%x\n", mask_result & 0xFFFF);
}
However, when I compile and run this, it prints 0xf0f0
. I would expect the result to follow the same order in which it was loaded from memory. To check a little further, I added some debugging statements, which are as follows :
const uint32_t byte_vals[] = {0, 5, 0, 3};
__m128i got_data = _mm_load_si128((__m128i const*)byte_vals);
printf("0x%llx 0x%llx\n", got_data[0], got_data[1]);
__m128i cmp_data = _mm_setzero_si128();
__m128i result = _mm_cmpgt_epi32 (got_data, cmp_data);
printf("0x%llx 0x%llx\n", result[0], result[1]);
int mask_result = _mm_movemask_epi8(result);
printf("Result 0x%x\n", mask_result & 0xFFFF);
This run prints
0x500000000 0x300000000
0xffffffff00000000 0xffffffff00000000
Result 0xf0f0
Thus, it seems here the culprit is _mm_load_si128
.
Based on this, how can I get _mm_load_si128
to load data in the same order as it is laid out in memory ?