I have a loop with the following structure :
- Calculate a byte array with length k (somewhere slow)
- Find if the calculated byte array matches any in a list of N byte arrays I have.
- Repeat
My loop is to be called many many times (it's the main loop of my program), and I want the second step to be as fast as possible.
The naive implementation for the second step would be using memcmp
:
char* calc;
char** list;
int k, n, i;
for(i = 0; i < n; i++) {
if (!memcmp(calc, list[i], k)) {
printf("Matches array %d", i);
}
}
Can you think of any faster way ? A few things :
- My list is fixed at the start of my program, any precomputation on it is fine.
- Let's assume that k is small (<= 64), N is moderate (around 100-1000).
- Performance is the goal here, and portability is a non issue. Intrinsics/inline assembly is fine, as long as it's faster.
Here are a few thoughts that I had :
- Given k<64 and I'm on x86_64, I could sort my lookup array as a long array, and do a binary search on it. O(log(n)). Even if k was big, I could sort my lookup array and do this binary search using
memcmp
. - Given k is small, again, I could compute a 8/16/32 bits checksum (the simplest being folding my arrays over themselves using a
xor
) of all my lookup arrays and use a built-inPCMPGT
as in How to compare more than two numbers in parallel?. I know SSE4.2 is available here.
Do you think going for vectorization/sse is a good idea here ? If yes, what do you think is the best approach. I'd like to say that this isn't early optimization, but performance is crucial here, I need the outer loop to be as fast as possible. Thanks
EDIT1: It looks like http://schani.wordpress.com/tag/c-optimization-linear-binary-search-sse2-simd/ provides some interesting thoughts about it. Binary search on a list of long
seems the way to go..