I am fiddling with AVX2 to write some code able to search for 32 bits hash in an array with 14 entries and return the index of the found entry.
Because most likely the vast majority of the hits will be within the first 8 entries of the array this code can already be improved adding the usage of __builtin_expect this is not my priority right now.
While the array of hashes (in the code represented by the variable hashes) will always be 14 entries long, it's contained in a struct of this kind
typedef struct chain_ring chain_ring_t;
struct chain_ring {
uint32_t hashes[14];
chain_ring_t* next;
...other stuff...
} __attribute__((aligned(16)))
Here the code
int8_t hash32_find_14_avx2(uint32_t hash, volatile uint32_t* hashes) {
uint32_t compacted_result_mask, leading_zeroes;
__m256i cmp_vector, ring_vector, result_mask_vector;
int8_t found_index = -1;
if (hashes[0] == hash) {
return 0;
}
for(uint8_t base_index = 0; base_index < 14; base_index += 8) {
cmp_vector = _mm256_set1_epi32(hash);
ring_vector = _mm256_stream_load_si256((__m256i*) (hashes + base_index));
result_mask_vector = _mm256_cmpeq_epi32(ring_vector, cmp_vector);
compacted_result_mask = _mm256_movemask_epi8(result_mask_vector);
if (compacted_result_mask != 0) {
leading_zeroes = 32 - __builtin_clz(compacted_result_mask);
found_index = base_index + (leading_zeroes >> 2u) - 1;
break;
}
}
return found_index > 13 ? -1 : found_index;
}
The logic, briefly explained, it searches on the first 8 entries and then on the second 8 entries. If the found index is greater than 13 it means that it found a match with some stuff that wasn't part of the array and therefore has to be considered not-matching.
Notes:
- to speedup the load (from aligned memory) I am using _mm256_stream_load_si256
- because of the above mentioned, I need to check if by any chance the returned value is greater than 13 and I don't really like this specific part too much, should I use _mm256_maskload_epi32?
- I am using a for-loop to avoid repeating the code, gcc of course will unroll the loop
- I am using __builtin_clz but I am compiling the code with -mlzcnt because AMD cpus, as far I have read, are way slower to run the bsr instruction, gcc is using lzcnt instead of bsr with the flag
- the very first IF introduced a delay of about 0.30 ns in average but in average it reduce by 0.6ns the time for the first match
- the code is only for 64bit machines
- at some point I will need to optimize this code for aarch64
Here a nice link to godbolt for the produced assembly https://godbolt.org/z/5bxbN6
I implemented the SSE version as well (it's in the gist) but the logic is the same, although I am not really sure it's performance worth
For reference, I built a simple linear search function and compared the performances with it using the google-benchmark lib
int8_t hash32_find_14_loop(uint32_t hash, volatile uint32_t* hashes) {
for(uint8_t index = 0; index <= 14; index++) {
if (hashes[index] == hash) {
return index;
}
}
return -1;
}
The full code is available at this url https://gist.github.com/danielealbano/9fcbc1ff0a42cc9ad61be205366bdb5f
Apart from the necessary flags for the google-benchmark library, I am compiling it using -avx2 -avx -msse4 -O3 -mbmi -mlzcnt
A bench for each element is performed (I wanted to compare the loop vs the alternatives)
----------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------------------------------------------
bench_template_hash32_find_14_loop/0/iterations:100000000 0.610 ns 0.610 ns 100000000
bench_template_hash32_find_14_loop/1/iterations:100000000 1.16 ns 1.16 ns 100000000
bench_template_hash32_find_14_loop/2/iterations:100000000 1.18 ns 1.18 ns 100000000
bench_template_hash32_find_14_loop/3/iterations:100000000 1.19 ns 1.19 ns 100000000
bench_template_hash32_find_14_loop/4/iterations:100000000 1.28 ns 1.28 ns 100000000
bench_template_hash32_find_14_loop/5/iterations:100000000 1.26 ns 1.26 ns 100000000
bench_template_hash32_find_14_loop/6/iterations:100000000 1.52 ns 1.52 ns 100000000
bench_template_hash32_find_14_loop/7/iterations:100000000 2.15 ns 2.15 ns 100000000
bench_template_hash32_find_14_loop/8/iterations:100000000 1.66 ns 1.66 ns 100000000
bench_template_hash32_find_14_loop/9/iterations:100000000 1.67 ns 1.67 ns 100000000
bench_template_hash32_find_14_loop/10/iterations:100000000 1.90 ns 1.90 ns 100000000
bench_template_hash32_find_14_loop/11/iterations:100000000 1.89 ns 1.89 ns 100000000
bench_template_hash32_find_14_loop/12/iterations:100000000 2.13 ns 2.13 ns 100000000
bench_template_hash32_find_14_loop/13/iterations:100000000 2.20 ns 2.20 ns 100000000
bench_template_hash32_find_14_loop/14/iterations:100000000 2.32 ns 2.32 ns 100000000
bench_template_hash32_find_14_loop/15/iterations:100000000 2.53 ns 2.53 ns 100000000
bench_template_hash32_find_14_sse/0/iterations:100000000 0.531 ns 0.531 ns 100000000
bench_template_hash32_find_14_sse/1/iterations:100000000 1.42 ns 1.42 ns 100000000
bench_template_hash32_find_14_sse/2/iterations:100000000 2.53 ns 2.53 ns 100000000
bench_template_hash32_find_14_sse/3/iterations:100000000 1.45 ns 1.45 ns 100000000
bench_template_hash32_find_14_sse/4/iterations:100000000 2.26 ns 2.26 ns 100000000
bench_template_hash32_find_14_sse/5/iterations:100000000 1.90 ns 1.90 ns 100000000
bench_template_hash32_find_14_sse/6/iterations:100000000 1.90 ns 1.90 ns 100000000
bench_template_hash32_find_14_sse/7/iterations:100000000 1.93 ns 1.93 ns 100000000
bench_template_hash32_find_14_sse/8/iterations:100000000 2.07 ns 2.07 ns 100000000
bench_template_hash32_find_14_sse/9/iterations:100000000 2.05 ns 2.05 ns 100000000
bench_template_hash32_find_14_sse/10/iterations:100000000 2.08 ns 2.08 ns 100000000
bench_template_hash32_find_14_sse/11/iterations:100000000 2.08 ns 2.08 ns 100000000
bench_template_hash32_find_14_sse/12/iterations:100000000 2.55 ns 2.55 ns 100000000
bench_template_hash32_find_14_sse/13/iterations:100000000 2.53 ns 2.53 ns 100000000
bench_template_hash32_find_14_sse/14/iterations:100000000 2.37 ns 2.37 ns 100000000
bench_template_hash32_find_14_sse/15/iterations:100000000 2.59 ns 2.59 ns 100000000
bench_template_hash32_find_14_avx2/0/iterations:100000000 0.537 ns 0.537 ns 100000000
bench_template_hash32_find_14_avx2/1/iterations:100000000 1.37 ns 1.37 ns 100000000
bench_template_hash32_find_14_avx2/2/iterations:100000000 1.38 ns 1.38 ns 100000000
bench_template_hash32_find_14_avx2/3/iterations:100000000 1.36 ns 1.36 ns 100000000
bench_template_hash32_find_14_avx2/4/iterations:100000000 1.37 ns 1.37 ns 100000000
bench_template_hash32_find_14_avx2/5/iterations:100000000 1.38 ns 1.38 ns 100000000
bench_template_hash32_find_14_avx2/6/iterations:100000000 1.40 ns 1.40 ns 100000000
bench_template_hash32_find_14_avx2/7/iterations:100000000 1.39 ns 1.39 ns 100000000
bench_template_hash32_find_14_avx2/8/iterations:100000000 1.99 ns 1.99 ns 100000000
bench_template_hash32_find_14_avx2/9/iterations:100000000 2.02 ns 2.02 ns 100000000
bench_template_hash32_find_14_avx2/10/iterations:100000000 1.98 ns 1.98 ns 100000000
bench_template_hash32_find_14_avx2/11/iterations:100000000 1.98 ns 1.98 ns 100000000
bench_template_hash32_find_14_avx2/12/iterations:100000000 2.03 ns 2.03 ns 100000000
bench_template_hash32_find_14_avx2/13/iterations:100000000 1.98 ns 1.98 ns 100000000
bench_template_hash32_find_14_avx2/14/iterations:100000000 1.96 ns 1.96 ns 100000000
bench_template_hash32_find_14_avx2/15/iterations:100000000 1.97 ns 1.97 ns 100000000
Thanks for any suggestion!
--- UPDATE
I have updated the gist with the branchless implementation made by @chtz and replaced __lzcnt32 with _tzcnt_u32, I had to slightly change the behaviour to consider not-found when 32 is returned instead of -1 but doesn't really matter.
The CPU on which they ran is an Intel Core i7 8700 (6c/12t, 3.20GHZ).
The bench uses cpu-pinning, uses more thread than physical or logical cpu cores and performs some additional operations, specifically a for loop, so there is overhead but it's the same between the two tests so it should impact them in the same way.
If you want to run the test you need to tune the CPU_CORE_LOGICAL_COUNT to manually match the number of the logical cpu cores of your cpu.
It's interesting to see how the performance improvement jumps from +17% to +41% when there is more contention (from single thread to 64 threads). I have ran a few more tests with 128 and 256 threads seeing up to a +60% speed improvement when using AVX2, but I haven't included the numbers below.
(bench_template_hash32_find_14_avx2 is benching the branchless version, I have shortened the name to make the post more readable)
------------------------------------------------------------------------------------------
Benchmark CPU Iterations
------------------------------------------------------------------------------------------
bench_template_hash32_find_14_loop/iterations:10000000/threads:1 45.2 ns 10000000
bench_template_hash32_find_14_loop/iterations:10000000/threads:2 50.4 ns 20000000
bench_template_hash32_find_14_loop/iterations:10000000/threads:4 52.1 ns 40000000
bench_template_hash32_find_14_loop/iterations:10000000/threads:8 70.9 ns 80000000
bench_template_hash32_find_14_loop/iterations:10000000/threads:16 86.8 ns 160000000
bench_template_hash32_find_14_loop/iterations:10000000/threads:32 87.3 ns 320000000
bench_template_hash32_find_14_loop/iterations:10000000/threads:64 92.9 ns 640000000
bench_template_hash32_find_14_avx2/iterations:10000000/threads:1 38.4 ns 10000000
bench_template_hash32_find_14_avx2/iterations:10000000/threads:2 42.1 ns 20000000
bench_template_hash32_find_14_avx2/iterations:10000000/threads:4 46.5 ns 40000000
bench_template_hash32_find_14_avx2/iterations:10000000/threads:8 52.6 ns 80000000
bench_template_hash32_find_14_avx2/iterations:10000000/threads:16 60.0 ns 160000000
bench_template_hash32_find_14_avx2/iterations:10000000/threads:32 62.1 ns 320000000
bench_template_hash32_find_14_avx2/iterations:10000000/threads:64 65.8 ns 640000000