I would like to vectorize an equality test in which all elements in a vector are compared against the same value, and the results are written to an array of 8-bit words. Each 8-bit word in the resulting array should be zero or one. (This is a little wasteful, but bit packing the booleans is not an import detail in this problem). This function can be written as:
#include <stdint.h>
void vecEq (uint8_t* numbers, uint8_t* results, int len, uint8_t target) {
for(int i = 0; i < len; i++) {
results[i] = numbers[i] == target;
}
}
If we knew that both vectors were 256-bit aligned, we could start by broadcasting target
into an AVX register and then using SIMD's _mm256_cmpeq_epi8
to perform 32 equality tests at a time. However, in the setting I'm working in, both numbers
and results
have been allocated by a runtime (the GHC runtime, but this is irrelevant). They are both guaranteed to be 64-bit aligned. Is there any way to vectorize this operation, preferably without using AVX registers?
The approach I've considered is broadcasting the 8-bit word to a 64-bit word up front and then XORing it with 8 elements at a time. This doesn't work though because I cannot find a vectorized way to convert the result of XOR (zero means equal, anything else means unequal) to a equality test result I need (0 means unequal, 1 means equal, nothing else should ever exist). Roughly, the sketch I have is:
void vecEq (uint64_t* numbers, uint64_t* results, int len, uint_8 target) {
uint64_t targetA = (uint64_t)target;
uint64_t targetB = targetA<<56 | targetA<<48 | targetA<<40 | targetA<<32 | targetA<<24 | targetA<<16 | targetA<<8 | targetA;
for(int i = 0; i < len; i++) {
uint64_t tmp = numbers[i] ^ targetB;
results[i] = ... something with tmp ...;
}
}