I am trying to vectorize a logical validation problem to run on Intel 64.
I will first try to describe the problem:
I have a static array v[]
of 70-bit integers (appx 400,000 of them) which are all known at compile time.
A producer creates 70-bit integers a
, a lot of them, very quickly.
For each a
I need to find out if there exists an element from v
for which v[i] & a == 0
.
So far my implementation in C is something like this (simplified):
for (; *v; v++) {
if (!(a & *v))
return FOUND;
}
// a had no matching element in v
return NOT_FOUND;
I am looking into optimizing this using SSE/AVX to speed up the process and do more of those tests in parallel. I got as far as loading a
and *v
into an XMM
register each and calling the PTEST
instruction to do the validation.
I am wondering if there is a way to expand this to use all 256 bits of the new YMM
registers?
Maybe packing 3x70 bits into a single register?
I can't quite figure out though how to pack/unpack them efficient enough to justify not just using one register per test.
A couple things that we know about the nature of the input:
- All elements in
v[]
have very few bits set - It is not possible to permute/compress
v[]
in any way to make it use less then 70 bits - The
FOUND
condition is expected to be satisfied after checking appx 20% onv[]
on average. - It is possible to buffer more then one
a
before checking them in a batch. - I do not necessarily need to know which element of
v[]
matched, only that one did or not. - Producing
a
requires very little memory, so anything left in L1 from the previous call is likely to still be there.
The resulting code is intended to be ran on the newest generation of Intel Xeon processors supporting SSE4.2, AVX instructions. I will be happy to accept assembly or C that compiles with Intel C compiler or at least GCC.