Find element index with AVX2 - code optimization

Question

I am fiddling with AVX2 to write some code able to search for 32 bits hash in an array with 14 entries and return the index of the found entry.

Because most likely the vast majority of the hits will be within the first 8 entries of the array this code can already be improved adding the usage of __builtin_expect this is not my priority right now.

While the array of hashes (in the code represented by the variable hashes) will always be 14 entries long, it's contained in a struct of this kind

typedef struct chain_ring chain_ring_t;
struct chain_ring {
    uint32_t hashes[14];
    chain_ring_t* next;
    ...other stuff...
} __attribute__((aligned(16)))

Here the code

int8_t hash32_find_14_avx2(uint32_t hash, volatile uint32_t* hashes) {
    uint32_t compacted_result_mask, leading_zeroes;
    __m256i cmp_vector, ring_vector, result_mask_vector;
    int8_t found_index = -1;

    if (hashes[0] == hash) {
        return 0;
    }

    for(uint8_t base_index = 0; base_index < 14; base_index += 8) {
        cmp_vector = _mm256_set1_epi32(hash);
        ring_vector = _mm256_stream_load_si256((__m256i*) (hashes + base_index));

        result_mask_vector = _mm256_cmpeq_epi32(ring_vector, cmp_vector);
        compacted_result_mask = _mm256_movemask_epi8(result_mask_vector);

        if (compacted_result_mask != 0) {
            leading_zeroes = 32 - __builtin_clz(compacted_result_mask);
            found_index = base_index + (leading_zeroes >> 2u) - 1;
            break;
        }
    }

    return found_index > 13 ? -1 : found_index;
}

The logic, briefly explained, it searches on the first 8 entries and then on the second 8 entries. If the found index is greater than 13 it means that it found a match with some stuff that wasn't part of the array and therefore has to be considered not-matching.

Notes:

to speedup the load (from aligned memory) I am using _mm256_stream_load_si256
because of the above mentioned, I need to check if by any chance the returned value is greater than 13 and I don't really like this specific part too much, should I use _mm256_maskload_epi32?
I am using a for-loop to avoid repeating the code, gcc of course will unroll the loop
I am using __builtin_clz but I am compiling the code with -mlzcnt because AMD cpus, as far I have read, are way slower to run the bsr instruction, gcc is using lzcnt instead of bsr with the flag
the very first IF introduced a delay of about 0.30 ns in average but in average it reduce by 0.6ns the time for the first match
the code is only for 64bit machines
at some point I will need to optimize this code for aarch64

Here a nice link to godbolt for the produced assembly https://godbolt.org/z/5bxbN6

I implemented the SSE version as well (it's in the gist) but the logic is the same, although I am not really sure it's performance worth

For reference, I built a simple linear search function and compared the performances with it using the google-benchmark lib

int8_t hash32_find_14_loop(uint32_t hash, volatile uint32_t* hashes) {
    for(uint8_t index = 0; index <= 14; index++) {
        if (hashes[index] == hash) {
            return index;
        }
    }

    return -1;
}

The full code is available at this url https://gist.github.com/danielealbano/9fcbc1ff0a42cc9ad61be205366bdb5f

Apart from the necessary flags for the google-benchmark library, I am compiling it using -avx2 -avx -msse4 -O3 -mbmi -mlzcnt

A bench for each element is performed (I wanted to compare the loop vs the alternatives)

----------------------------------------------------------------------------------------------------
Benchmark                                                          Time             CPU   Iterations
----------------------------------------------------------------------------------------------------
bench_template_hash32_find_14_loop/0/iterations:100000000       0.610 ns        0.610 ns    100000000
bench_template_hash32_find_14_loop/1/iterations:100000000        1.16 ns         1.16 ns    100000000
bench_template_hash32_find_14_loop/2/iterations:100000000        1.18 ns         1.18 ns    100000000
bench_template_hash32_find_14_loop/3/iterations:100000000        1.19 ns         1.19 ns    100000000
bench_template_hash32_find_14_loop/4/iterations:100000000        1.28 ns         1.28 ns    100000000
bench_template_hash32_find_14_loop/5/iterations:100000000        1.26 ns         1.26 ns    100000000
bench_template_hash32_find_14_loop/6/iterations:100000000        1.52 ns         1.52 ns    100000000
bench_template_hash32_find_14_loop/7/iterations:100000000        2.15 ns         2.15 ns    100000000
bench_template_hash32_find_14_loop/8/iterations:100000000        1.66 ns         1.66 ns    100000000
bench_template_hash32_find_14_loop/9/iterations:100000000        1.67 ns         1.67 ns    100000000
bench_template_hash32_find_14_loop/10/iterations:100000000       1.90 ns         1.90 ns    100000000
bench_template_hash32_find_14_loop/11/iterations:100000000       1.89 ns         1.89 ns    100000000
bench_template_hash32_find_14_loop/12/iterations:100000000       2.13 ns         2.13 ns    100000000
bench_template_hash32_find_14_loop/13/iterations:100000000       2.20 ns         2.20 ns    100000000
bench_template_hash32_find_14_loop/14/iterations:100000000       2.32 ns         2.32 ns    100000000
bench_template_hash32_find_14_loop/15/iterations:100000000       2.53 ns         2.53 ns    100000000
bench_template_hash32_find_14_sse/0/iterations:100000000        0.531 ns        0.531 ns    100000000
bench_template_hash32_find_14_sse/1/iterations:100000000         1.42 ns         1.42 ns    100000000
bench_template_hash32_find_14_sse/2/iterations:100000000         2.53 ns         2.53 ns    100000000
bench_template_hash32_find_14_sse/3/iterations:100000000         1.45 ns         1.45 ns    100000000
bench_template_hash32_find_14_sse/4/iterations:100000000         2.26 ns         2.26 ns    100000000
bench_template_hash32_find_14_sse/5/iterations:100000000         1.90 ns         1.90 ns    100000000
bench_template_hash32_find_14_sse/6/iterations:100000000         1.90 ns         1.90 ns    100000000
bench_template_hash32_find_14_sse/7/iterations:100000000         1.93 ns         1.93 ns    100000000
bench_template_hash32_find_14_sse/8/iterations:100000000         2.07 ns         2.07 ns    100000000
bench_template_hash32_find_14_sse/9/iterations:100000000         2.05 ns         2.05 ns    100000000
bench_template_hash32_find_14_sse/10/iterations:100000000        2.08 ns         2.08 ns    100000000
bench_template_hash32_find_14_sse/11/iterations:100000000        2.08 ns         2.08 ns    100000000
bench_template_hash32_find_14_sse/12/iterations:100000000        2.55 ns         2.55 ns    100000000
bench_template_hash32_find_14_sse/13/iterations:100000000        2.53 ns         2.53 ns    100000000
bench_template_hash32_find_14_sse/14/iterations:100000000        2.37 ns         2.37 ns    100000000
bench_template_hash32_find_14_sse/15/iterations:100000000        2.59 ns         2.59 ns    100000000
bench_template_hash32_find_14_avx2/0/iterations:100000000       0.537 ns        0.537 ns    100000000
bench_template_hash32_find_14_avx2/1/iterations:100000000        1.37 ns         1.37 ns    100000000
bench_template_hash32_find_14_avx2/2/iterations:100000000        1.38 ns         1.38 ns    100000000
bench_template_hash32_find_14_avx2/3/iterations:100000000        1.36 ns         1.36 ns    100000000
bench_template_hash32_find_14_avx2/4/iterations:100000000        1.37 ns         1.37 ns    100000000
bench_template_hash32_find_14_avx2/5/iterations:100000000        1.38 ns         1.38 ns    100000000
bench_template_hash32_find_14_avx2/6/iterations:100000000        1.40 ns         1.40 ns    100000000
bench_template_hash32_find_14_avx2/7/iterations:100000000        1.39 ns         1.39 ns    100000000
bench_template_hash32_find_14_avx2/8/iterations:100000000        1.99 ns         1.99 ns    100000000
bench_template_hash32_find_14_avx2/9/iterations:100000000        2.02 ns         2.02 ns    100000000
bench_template_hash32_find_14_avx2/10/iterations:100000000       1.98 ns         1.98 ns    100000000
bench_template_hash32_find_14_avx2/11/iterations:100000000       1.98 ns         1.98 ns    100000000
bench_template_hash32_find_14_avx2/12/iterations:100000000       2.03 ns         2.03 ns    100000000
bench_template_hash32_find_14_avx2/13/iterations:100000000       1.98 ns         1.98 ns    100000000
bench_template_hash32_find_14_avx2/14/iterations:100000000       1.96 ns         1.96 ns    100000000
bench_template_hash32_find_14_avx2/15/iterations:100000000       1.97 ns         1.97 ns    100000000

Thanks for any suggestion!

--- UPDATE

I have updated the gist with the branchless implementation made by @chtz and replaced __lzcnt32 with _tzcnt_u32, I had to slightly change the behaviour to consider not-found when 32 is returned instead of -1 but doesn't really matter.

The CPU on which they ran is an Intel Core i7 8700 (6c/12t, 3.20GHZ).

The bench uses cpu-pinning, uses more thread than physical or logical cpu cores and performs some additional operations, specifically a for loop, so there is overhead but it's the same between the two tests so it should impact them in the same way.

If you want to run the test you need to tune the CPU_CORE_LOGICAL_COUNT to manually match the number of the logical cpu cores of your cpu.

It's interesting to see how the performance improvement jumps from +17% to +41% when there is more contention (from single thread to 64 threads). I have ran a few more tests with 128 and 256 threads seeing up to a +60% speed improvement when using AVX2, but I haven't included the numbers below.

(bench_template_hash32_find_14_avx2 is benching the branchless version, I have shortened the name to make the post more readable)

------------------------------------------------------------------------------------------
Benchmark                                                                 CPU   Iterations
------------------------------------------------------------------------------------------
bench_template_hash32_find_14_loop/iterations:10000000/threads:1      45.2 ns     10000000
bench_template_hash32_find_14_loop/iterations:10000000/threads:2      50.4 ns     20000000
bench_template_hash32_find_14_loop/iterations:10000000/threads:4      52.1 ns     40000000
bench_template_hash32_find_14_loop/iterations:10000000/threads:8      70.9 ns     80000000
bench_template_hash32_find_14_loop/iterations:10000000/threads:16     86.8 ns    160000000
bench_template_hash32_find_14_loop/iterations:10000000/threads:32     87.3 ns    320000000
bench_template_hash32_find_14_loop/iterations:10000000/threads:64     92.9 ns    640000000
bench_template_hash32_find_14_avx2/iterations:10000000/threads:1      38.4 ns     10000000
bench_template_hash32_find_14_avx2/iterations:10000000/threads:2      42.1 ns     20000000
bench_template_hash32_find_14_avx2/iterations:10000000/threads:4      46.5 ns     40000000
bench_template_hash32_find_14_avx2/iterations:10000000/threads:8      52.6 ns     80000000
bench_template_hash32_find_14_avx2/iterations:10000000/threads:16     60.0 ns    160000000
bench_template_hash32_find_14_avx2/iterations:10000000/threads:32     62.1 ns    320000000
bench_template_hash32_find_14_avx2/iterations:10000000/threads:64     65.8 ns    640000000

`_mm256_stream_load_si256`? Is your data in video RAM, or are you somehow mapping memory pages as WC, instead of the normal WB cacheable? If not, then `vmovntdqa` loads are just slow versions of normal loads. Also, use `_mm256_movemask_ps`, or `packssdw` / `packsswb` your dword vectors together before movemask_epi8, so you get more data per branch. — Peter Cordes, May 31 '20 at 21:35
@PeterCordes No my data are not absolutely from the video memory, I will immediately switch to the loadu intrinsic. Thanks for the hints! — Daniele Salvatore Albano, May 31 '20 at 22:18
`__builtin_clz` is undefined for 0, in fact gcc happily optimizes `31 - __builtin_clz(x)` to a `bsr` (which is undefined for zero input as well). — chtz, May 31 '20 at 23:05
@chtz the code will never hit that condition because of the following check `if (compacted_result_mask != 0) {` — Daniele Salvatore Albano, May 31 '20 at 23:08
You are right (just noted this while implementing a branchless version (which I'll post as answer in a moment) — chtz, May 31 '20 at 23:12
Since you want the leading-zero count, you might want `_lzcnt_u32` instead of the GNU C builtin. I think all AVX2 machines also have `lzcnt` (and the rest of BMI1), so you're not missing out on anything by requiring BMI1 as well. Unless you really wanted `32 - clz` instead of `31-clz` — Peter Cordes, May 31 '20 at 23:13
N.B.: If I understand your benchmark code correctly, your `loop` results are very likely flawed, since testing for the same index every time will give you near-perfect branch predictions (actually also in your branching avx2 code). Unless, of course, you actually expect that behavior in practice. — chtz, Jun 01 '20 at 00:29
@chtz yes, I totally get your point, I was trying to identify the differences of the lookup in relation to the position but this is not doing a proper comparison, I will review it. Thanks for all these super useful hints and your help! — Daniele Salvatore Albano, Jun 01 '20 at 08:48
You should compile with `-march=native` on your local machine to set tuning options appropriately, and let the compiler use all your CPU's features (like cmpxchg16b, FMA, and BMI2). — Peter Cordes, Jun 02 '20 at 00:21

score 3 · Accepted Answer · answered May 31 '20 at 23:22

3

You can implement this completely without branches, by comparing two overlapping parts of your array, bit-OR them together and get the last bit position with a single lzcnt. Also, using vmovmskps instead of vpmovmskb saves dividing the result by 4 (I'm not sure if this causes any domain-crossing latency, though).

int8_t hash32_find_14_avx2(uint32_t hash, volatile uint32_t* hashes) {
    uint32_t compacted_result_mask = 0;
    __m256i cmp_vector = _mm256_set1_epi32(hash);
    for(uint8_t base_index = 0; base_index < 12; base_index += 6) {
        __m256i ring_vector = _mm256_loadu_si256((__m256i*) (hashes + base_index));

        __m256i result_mask_vector = _mm256_cmpeq_epi32(ring_vector, cmp_vector);
        compacted_result_mask |= _mm256_movemask_ps(_mm256_castsi256_ps(result_mask_vector)) << (base_index);
    }
    int32_t leading_zeros = __lzcnt32(compacted_result_mask);
    return (31 - leading_zeros);
}

As Peter already pointed out in the comments, in most cases _mm256_stream_load_si256 is worse than normal loads. Also, be aware that when using unaligned loads with gcc you must compile with -mno-avx256-split-unaligned-load (or in fact just with -march=native) -- see this post for details.

Godbolt-Link with simple test code (note that the behavior would be different for the loop- and the avx2-version, if multiple matching values are in the array): https://godbolt.org/z/2jNWqK

answered May 31 '20 at 23:22

chtz

17,329
4
26
56

Thanks @chtz, without all these jumps the code is WAY faster! – Daniele Salvatore Albano May 31 '20 at 23:44
I have updated the gist with the code you provided for reference, thanks a lot! https://gist.github.com/danielealbano/9fcbc1ff0a42cc9ad61be205366bdb5f – Daniele Salvatore Albano May 31 '20 at 23:53
OR to combine the overlapping masks works well, but if your data is 32-byte aligned you could avoid an unaligned load and simply mask away the high 2 bits instead of branching on them like the OP does. It looks like `vpackssdw` isn't worth doing. It would let you do only one movemask and avoid scalar shift / OR, but probably it's in-lane and there's no movemask epi16, only 8 (or 32 / 64 with FP), so you'd need another vector shuffle and more scalar code. Your way has good ILP as well as probably as few uops as any other way, so that's good. – Peter Cordes Jun 01 '20 at 00:13
@PeterCordes I was also thinking about something with `vpacksswb`, but indeed this looks not worth it, as the packing happens only in-lane. But it probably is a good alternative for a no-AVX2 version (combining 4 `pcmpeqd` results to a single mask with three `packsswb` and then use a single `pmovmskb` (masking-out indexes 14,15 at some point). – chtz Jun 01 '20 at 00:41
2

Ah yes, 2x packssdw -> packsswb would work well for 16-byte vectors. In that case the last 2 elements could be done with a `movq` load. Or maybe a SSE3 `movddup` load to repeat the same compare in the top 2 elements, in case the hash code can be `0`. Wait a minute, I just realized this is using `lzcnt`/bsr instead of `tzcnt`/bsf, so it's finding the *latest* match, not the earliest, opposite of the scalar code. If we are scanning from the lowest bit, then having bit 14,15 be repeats of 12,13 means tzcnt either stops before them or scans past them because they're 0. – Peter Cordes Jun 01 '20 at 00:50
@DanieleSalvatoreAlbano: are you sure you wanted `lzcnt` / `clz` to find the highest set bit, instead of `tzcnt` / `ctz` / BSF to find the lowest / *first* set bit, i.e. scanning in increasing order of array index? Then you just need tzcnt instead of 31-lzcnt. – Peter Cordes Jun 01 '20 at 00:51
1

I guess OP expects no duplicate hashes -- or is fine with returning any valid index in that case (otherwise the original version already would not work correctly, either). The advantage of `lzcnt` vs `tzcnt` is that it is easier to produce a `-1` result if there are no bits set (whether returning `-1`, instead of checking the Z-flag is actually better is another question. That's probably cheaper than doing another compare+branch on the index-number later). – chtz Jun 01 '20 at 01:04
As edge case I expect multiple matches but my knowledge of AVX2 is close to zero so I thought would have been better to do a step at a time. This search is paired with a key check (actually it's just a check on the prefix and a comparison of the length), if that fails I have to repeat the search and exclude the found but non-matching entry and, looking at the implementation, I am thinking that simply appling a &= mask to compacted_result_mask would let me to easily achieve the goal without impacting the performances too much. What do you think? – Daniele Salvatore Albano Jun 01 '20 at 09:12
Just to mention it, in case there is a mismatch on the key (because it's different or because of multiple hashes) the test has to be performed again from the beginning and this would be true for the plain loop version as well because this is being used with a concurrent hashtable that is relying on memory fences and a few atomic ops – Daniele Salvatore Albano Jun 01 '20 at 09:16
1

@DanieleSalvatoreAlbano: So you need to index something else based on the position of set bits in this? Sounds like you want to loop through the set bits and find theirs positions, and you will need to branch. It's easier to iterate from lowest to highest set bit, because you can clear the lowest set bit with a single BMI1 instruction, [`mask &= mask-1` i.e. BLSR](https://www.felixcloutier.com/x86/blsr). (And then `tzcnt` that). Oh, just saw your last comment. If you have to redo the compare then maybe you're not doing that. But anyway you probably don't need an actual `-1` and can tzcnt – Peter Cordes Jun 01 '20 at 09:18
Thanks! I will try with tzcnt as well! – Daniele Salvatore Albano Jun 01 '20 at 09:22
@PeterCordes I updated the tests with yours and chtz suggestions ( https://gist.github.com/danielealbano/9fcbc1ff0a42cc9ad61be205366bdb5f ) to use tzcnt and to try reduce the branching detecting ability of the CPU (but they do an amazing job). I updated the first post as well with the new results. – Daniele Salvatore Albano Jun 01 '20 at 19:28

Find element index with AVX2 - code optimization

1 Answers1