x86-64 integer vectorisation optimise

Question

I am trying to vectorize a logical validation problem to run on Intel 64.

I will first try to describe the problem:

I have a static array v[] of 70-bit integers (appx 400,000 of them) which are all known at compile time.

A producer creates 70-bit integers a, a lot of them, very quickly.

For each a I need to find out if there exists an element from v for which v[i] & a == 0.

So far my implementation in C is something like this (simplified):

for (; *v; v++) {
    if (!(a & *v))
        return FOUND;
}

// a had no matching element in v
return NOT_FOUND;

I am looking into optimizing this using SSE/AVX to speed up the process and do more of those tests in parallel. I got as far as loading a and *v into an XMM register each and calling the PTEST instruction to do the validation.

I am wondering if there is a way to expand this to use all 256 bits of the new YMM registers? Maybe packing 3x70 bits into a single register? I can't quite figure out though how to pack/unpack them efficient enough to justify not just using one register per test.

A couple things that we know about the nature of the input:

All elements in v[] have very few bits set
It is not possible to permute/compress v[] in any way to make it use less then 70 bits
The FOUND condition is expected to be satisfied after checking appx 20% on v[] on average.
It is possible to buffer more then one a before checking them in a batch.
I do not necessarily need to know which element of v[] matched, only that one did or not.
Producing a requires very little memory, so anything left in L1 from the previous call is likely to still be there.

The resulting code is intended to be ran on the newest generation of Intel Xeon processors supporting SSE4.2, AVX instructions. I will be happy to accept assembly or C that compiles with Intel C compiler or at least GCC.

If all you're doing is a simple AND'ing operation, the bottleneck is probably going to be in the producer code. Packing 70-bit data into bit-streams is going to be more expensive than just checking if they are zero. — Mysticial, Oct 14 '12 at 17:08
Also, 256-bit integer operations are in AVX2. So you could mess with the 256-bit floating-point bitwise instructions, but they have lower-throughput than the 128-bit integer bitwise instructions. So I doubt forcing yourself to use 256-bit vectors is gonna help at all. — Mysticial, Oct 14 '12 at 17:11
Make sure you don't have redundant values in `v`. If `(v[i] & v[j]) == v[i]`, then you don't need to test `v[j]`. — ughoavgfhw, Oct 14 '12 at 18:01
@ughoavgfhw this is not the case. `v[]` is 100% non redundant, non compressible. — Sergey L., Oct 14 '12 at 18:20
This problem was already discussed in other question: ["efficiently find the first element matching a bit mask"](http://stackoverflow.com/q/9246017/1009831). Most promising are two answers by wildplasser (one is easily vectorizable, other one uses more optimal algorithm). — Evgeny Kluev, Oct 14 '12 at 19:30
What are the computing resources you have available? If there are only 25 bit in each `a` set, you could precalculate the result for each possible a - if you have the required 70!/(45!*25!) bit ~= 800 PByte storage. — Gunther Piez, Oct 15 '12 at 00:05
@hirschhornsalz I have a cluster with around 200 nodes à 48cores, 8GB memory per node. Disk storage is virtually unavailable, at least not in that quantity. — Sergey L., Oct 15 '12 at 13:38
@EvgenyKluev this is actually not a bad idea. Will try that. I still want to know id there is an efficient way of vectorizing this. In case I encounter a similar vectorisation problem later. — Sergey L., Oct 15 '12 at 13:39

score 3 · Answer 1 · answered Oct 14 '12 at 17:40

This sounds like you what you really need is a better data structure to store the v[], so that searches take less than linear time.

Consider that if (v[0] & v[1]) & a is not zero, then neither (v[0] & a) nor (v[1] & a) can be zero. This means it is possible to create a tree structure where the v[] are the leaves, and the parent nodes are the AND combination of their children. Then, if parentNode & a gives you a non-zero value, you can skip looking at the children.

However, this isn't necessarily helpful - the parent node only ends up testing the bits common between the children, so if there are only a few of those, you still end up testing lots of leave nodes. But if you can find clusters in your data set and group many similar v[] under a common parent, this may drastically reduce the number of comparisons you have to do.

On the other hand, such a tree search involves a lot of conditional branches (expensive), and would be hard to vectorize. I'd first try if you can get away with just two levels: first do a vectorized search among the cluster parent nodes, then for each match do a search for the entries in that cluster.

Actually here's another idea, to help with the fact that 70 bits don't fit well into registers: You could split v[] into 64 (=2^6) different arrays. Of the 70 bits in the original v[], the 6 most significant bits are used to determine which array will contain the value, and only the remaining 64 bits are actually stored in the array.

By testing the mask a against the array indices, you will know which of the 64 arrays to search (in the worst case, if a doesn't have any of the 6 highest bits set, that'll be all of them), and each individual array search deals only with 64 bits per element (much easier to pack).

In fact this second approach could be generalized into a tree structure as well, which would give you some sort of trie.

The first idea might work, but I am unsure how well. I have just rechecked the data. Each element in `v[]` has only 4 bits set. They are generated as an orbit of a permutation group onto 70 elements. The last idea is nice to get around the 70 bit problem, will produce only 1 giant array (for the 0 case) and very small arrays in the other cases. `a`'s have between 20 and 25 bits set — Sergey L., Oct 14 '12 at 18:53

x86-64 integer vectorisation optimise

1 Answers1