Fastest way to find 16bit match in a 4 element short array?

Question

I may confirm by using nanobench. Today I don't feel clever and can't think of an easy way

I have a array, short arr[]={0x1234, 0x5432, 0x9090, 0xFEED};. I know I can use SIMD to compare all elements at once, using movemask+tzcnt to find the index of a match. However since it's only 64 bits I was wondering if there's a faster way?

First I thought maybe I can build a 64-bit int by writing target|(target<<16)|(target<<32)|(target<<48) but then realized both an AND and SUB isn't the same as a compare since the low 16 can affect the higher 16. Then I thought instead of a plain loop I can write index=tzcnt((target==arr[0]?1:0)... | target==arr[3]?8:0

Can anyone think of something more clever? I suspect using the ternary method would give me best results since it's branchless?

I implemented ternary expressions without a branch by hand????? and godbolt shows me that clang does this without a branch? `return v == 5 ? 1 : v == 8 ? 3 : 99;` Why the heck wouldn't you think a ternary isn't branchless when you're not calling a function? — , Dec 14 '22 at 19:20
Ternary is _not_ branchless. It is _just_ syntactic sugar for `if/else`. The asm code for a ternary has to branch (or, on x86, it might use conditional move (e.g `cmovl`)). — Craig Estey, Dec 14 '22 at 19:21
@Henry I don't see a difference between `return v == 5 ? 1 : v == 8 ? 3 : 99;` and an equivalent version with `if` statements. [Demo](https://godbolt.org/z/h5Mqsx4hz). The fact that a compiler can optimize something to be branchless doesn't mean the statements used are inherently branchless. — Nelfeal, Dec 14 '22 at 19:25
What are you trying to do? Similar to `memchr` or `memcmp`? For something this short I'd say ordinary insts will be fastest. So, code a loop on `short` values. The compiler may automatically use SIMD if appropriate, just like it does for inline/builtin `mem*` functions. The processor will fetch a cache line, so probably no extra memory fetches. Compiler may even code `scasw` — Craig Estey, Dec 14 '22 at 19:27
@Nelfeal I said THIS usage of ternary will be branchless. My question is if there's something more clever I can do so I can help the optimizer out in producing better code. I don't understand your last sentence. This is obviously branchless even if cmov isn't available — , Dec 14 '22 at 19:34

Nelfeal · Answer 1 · 2022-12-14T19:45:25.720

1

Clang with -O2 or -O3 and GCC with -O3 compile a simple search loop into branchless instructions:

int indexOf(short target, short* arr) {
    int index = -1;
    for (int i = 0; i < 4; ++i) {
        if (target == arr[i]) {
            index = i;
        }
    }
    return index;
}

Demo

I doubt you can get much better without SIMD. In other words, write simple and understandable code to help the compiler produce efficient code.

Side note: for some reason, neither Clang nor GCC use conditional moves on this very similar code:

int indexOf(short target, short* arr) {
    for (int i = 0; i < 4; ++i) {
        if (target == arr[i]) {
            return i;
        }
    }
    return -1;
}

edited Dec 14 '22 at 19:45

answered Dec 14 '22 at 19:37

Nelfeal

12,593
1
20
39

Based on other code I've seen, `gcc` will use conditional moves without any more options if it sees a benefit. – Craig Estey Dec 14 '22 at 19:41
@CraigEstey See edit. I have no clue why having the array inside the function made GCC not use conditional moves... – Nelfeal Dec 14 '22 at 19:42
I had done my own test with the array as function scoped and global (vs. argument). It did 4 `cmp` with `je` to four separate return blocks for scoped: `cmp $0x1234,%di je L00 cmp $0x5432,%di je L01 ... L00: xor %eax,%eax retq L01: mov $0x1,%eax retq` – Craig Estey Dec 14 '22 at 19:53
If I didn't check myself I wouldn't have believed. On x86-64 this is the SAME performance as peters code and simd, HOWEVER on mac it's far slower. Having the array as a global const made both version use cmov https://godbolt.org/z/9aevbr34W – Dec 15 '22 at 01:02
Strange, now it's about the same when I write my SIMD test. But it looks like it was running on an efficient core before and not one now. I wonder if there's a way I can control that for test – Dec 15 '22 at 01:49
@Henry: With a global constant array, and a compile-time constant match arg, the checking may optimize away. Even without actually inlining, GCC inter-procedural optimization (IPA) may figure out that the return value is always false and that there are no side-effects, thus avoiding the call. Your 0.25ns per call is 1 cycle per iteration on a 4GHz CPU, and we know neither of these versions can actually run that fast (front-end throughput limits, especially if they don't inline and have to set up their scalar constants every time. Or SIMD port 5 limits on Intel.) – Peter Cordes Dec 15 '22 at 19:15
@PeterCordes I ought to look into nanobench source. *So far* when it says something is faster than something else I've been able to see the improve performance in my real code. However, it is dubious an iteration is .25 ns. I should see if I can find the real margin of error and why it's reporting to be that low. I implemented the SIMD code and saw worse performance, however looking into `perf` I notice it's because the branch predictor was always good since I used repetitive values in my test. I kept the SIMD change and made a note to myself to double check using real code – Dec 15 '22 at 19:59

Peter Cordes · Accepted Answer · 2022-12-15T00:25:12.823

For SWAR compare-for-equality, the operation you want is XOR, which like SUB produces all-zero on equal inputs, but unlike SUB doesn't propagate carry sideways.

But then you need to detect a contiguous 16 0 bits. Unlike pcmpeqw, you'll have some zero bits in the other elements.

So it's probably about the same as https://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord but with wider mask patterns to operate on 16-bit instead of 8-bit chunks.

There is yet a faster method — use hasless(v, 1), which is defined below; it works in 4 operations and requires no subsquent verification. It simplifies to
#define haszero(v) (((v) - 0x01010101UL) & ~(v) & 0x80808080UL)
The subexpression (v - 0x01010101UL), evaluates to a high bit set in any byte whenever the corresponding byte in v is zero or greater than 0x80. The sub-expression ~v & 0x80808080UL evaluates to high bits set in bytes where the byte of v doesn't have its high bit set (so the byte was less than 0x80). Finally, by ANDing these two sub-expressions the result is the high bits set where the bytes in v were zero, since the high bits set due to a value greater than 0x80 in the first sub-expression are masked off by the second.

This bithack was originally by Alan Mycroft in 1987.

So it could look like this (untested):

#include <stdint.h>
#include <string.h>

// returns 0 / non-zero status.
uint64_t hasmatch_16in64(uint16_t needle, const uint16_t haystack[4])
{
    uint64_t vneedle = 0x0001000100010001ULL * needle;  // broadcast
    uint64_t vbuf;
    memcpy(&vbuf, haystack, sizeof(vbuf));  // aliasing-safe unaligned load
        //static_assert(sizeof(vbuf) == 4*sizeof(haystack[0]));

    uint64_t match = vbuf ^ vneedle;
    uint64_t any_zeros = (match - 0x0001000100010001ULL) & ~match & 0x8000800080008000ULL;
    return any_zeros;
    // unsigned matchpos = _tzcnt_u32(any_zeros) >> 4;  // I think.
}

Godbolt with GCC and clang, also including a SIMD intrinsics version.

# gcc12.2 -O3 -march=x86-64-v3 -mtune=znver1
# x86-64-v3 is the Haswell/Zen1 baseline: AVX2+FMA+BMI2, but with tune=generic
# without tune=haswell or whatever, GCC uses shl/add /shl/add instead of imul, despite still needing the same constant

hasmatch_16in64:
        movabs  rax, 281479271743489       #    0x1000100010001
        movzx   edi, di                    # zero-extend to 64-bit
        imul    rdi, rax                   # vneedle
        xor     rdi, QWORD PTR [rsi]       # match
   # then the bithack
        mov     rdx, rdi
        sub     rdx, rax
        andn    rax, rdi, rdx              # BMI1
        movabs  rdx, -9223231297218904064  # 0x8000800080008000
        and     rax, rdx
        ret

Clang unfortunately adds 0xFFFEFFFEFFFEFFFF instead of reusing the multiplier constant, so it has three 64-bit immediate constants.

AArch64 can do repeating-pattern constants like this as immediates for bitwise ops, and doesn't have as convenient SIMD movemask, so this might be more of a win there, especially if you can guarantee alignment of your array of shorts.

Match position

If you need to know where the match is, I think that bithack has a 1 in the high bit of each zero byte or u16, and nowhere else. (The lowest-precendence / last operations are bitwise AND involving 0x80008000...).

So maybe tzcnt(any_zeros) >> 4 to go from bit-index to u16-index, rounding down. e.g. if the second one is zero, the tzcnt result will be 31. 31 >> 4 = 1.

If that doesn't work, then yeah AVX2 or AVX-512 vpbroadcastw xmm0, edi / vmovq / vpcmeqw / vpmovmskb / tzcnt will work well, too, with smaller code-size and fewer uops, but maybe higher latency. Or maybe less. (To get a byte offset, right shift if you need an index of which short.)

Actually just SSE2 pshuflw can broadcast a word to the low qword of an XMM register. Same for MMX, which would actually allow a memory-source pcmpeqw mm0, [rsi] since it has no alignment requirement and is only 64-bit, not 128.

If you can use SIMD intrinsics, especially if you have efficient word broadcast from AVX2, definitely have a look at it.

#include <immintrin.h>

// note the unsigned function arg, not uint16_t;
// we only use the low 16, but GCC doesn't realize that and wastes an instruction in the non-AVX2 version
int hasmatch_SIMD(unsigned needle, const uint16_t haystack[4])
{
#ifdef __AVX2__   // or higher
    __m128i vneedle = _mm_set1_epi16(needle);
#else
    __m128i vneedle =  _mm_cvtsi32_si128(needle);  // movd
    vneedle = _mm_shufflelo_epi16(vneedle, 0);     // broadcast to low half
#endif

    __m128i vbuf = _mm_loadl_epi64((void*)haystack);    // alignment and aliasing safe
    unsigned mask = _mm_movemask_epi8(_mm_cmpeq_epi16(vneedle, vbuf));
    //return _tzcnt_u32(mask) >> 1;
    return mask;
}

# clang expects narrow integer args to already be zero- or sign-extended to 32
hasmatch_SIMD:
        movd    xmm0, edi
        pshuflw xmm0, xmm0, 0                   # xmm0 = xmm0[0,0,0,0,4,5,6,7]
        movq    xmm1, qword ptr [rsi]           # xmm1 = mem[0],zero
        pcmpeqw xmm1, xmm0
        pmovmskb        eax, xmm1
        ret

AXV-512 gives us vpbroadcastw xmm0, edi, replacing vmovd + vpbroadcastw xmm,xmm or movd + pshuflw, saving a shuffle uop.

With AVX2, this is 5 single-uop instructions, vs. 7 (or 9 counting the constants) for the SWAR bithack. Or 6 or 8 not counting the zero-extension of the "needle". So SIMD is better for front-end throughput. (https://agner.org/optimize/ / https://uops.info/)

There are limits to which ports some of these instructions can run on (vs. the bithack instructions mostly being any integer ALU port), but presumably you're not doing this in a loop over many such 4-element arrays. Or else SIMD is an obvious win; checking two 4-element arrays at once in the low and high halves of a __m128i. So probably we do need to consider the front-end costs of setting up those constants.

I didn't add up the latencies; it's probably a bit higher even on Intel CPUs which generally have good latency between integer and SIMD units.

GCC unfortunately fails to optimize away the movzx edi, di from the SIMD version if compiled without AVX2; only clang realizes the upper 16 of _mm_cvtsi32_si128(needle) is discarded by the later shuffle. Maybe better to make the function arg unsigned, not explicitly a narrow 16-bit type.

This is the answer I would have written if it weren't already here :-) Alan Mycroft's null-byte finding magic from 1987 (original reference: https://groups.google.com/g/comp.lang.c/c/2HtQXvg7iKc/m/xOJeipH6KLMJ ) is definitely the first thing to try for these kind of problems. — njuffa, Dec 14 '22 at 22:26
Wow, I thought there was going to be a clever answer but I didn't think it'd be this clever. I'll test it out tonight and reread it a few times to make sure I can understand it and hopefully be more clever — , Dec 14 '22 at 23:18
@Henry: Before SIMD was a thing, or on CPUs where it's not good for searching (e.g. because of big stalls when getting data from SIMD to integer regs, like on some ARM CPUs, unlike x86), this kind of SWAR trick was the best way to implement `strlen`, `strcpy`, `memcmp`, and similar loops over byte arrays, doing a whole word at a time. These techniques are somewhat well-known and still exist in glibc source code. (including the portable C fallback: [Why does glibc's strlen need to be so complicated to run quickly?](//stackoverflow.com/q/57650895)) Welcome to the world of bithacks and SWAR. — Peter Cordes, Dec 14 '22 at 23:49
@Henry: Updated with a SIMD version. It's only 5 uops, vs. 8 for the bithack including the constants. And tzcnt on its result would give you a byte offset you could use directly (if you can get C to trust you that it's a multiple of 2, e.g. `arr[mask>>1]` probably compiles to a right shift and then scaled-index addressing.) — Peter Cordes, Dec 15 '22 at 00:20
@PeterCordes I made this minutes ago too. I'll look over your SIMD. Surprise results! Your code (slightly modified) is the SAME SPEED as my SIMD. Only difference is non simd func is 80 bytes while simd is 48bytes https://godbolt.org/z/4xc43T76G — , Dec 15 '22 at 00:27
I edited in the 128 bit version so it's more like your code. I then added the results to the end. https://godbolt.org/z/e48ojPv7f mm128 is 32bytes instead of 48. However my real project uses avx2, I hear I shouldn't use 128bit simd in the same functions as 256bit simd? — , Dec 15 '22 at 00:36
@Henry: if you're doing it in a tight loop, obviously you'd want to use SIMD (with `movq` / `movhps` to load two separate 64-bit arrays); it's only worth considering the bithack if you're not doing that. But then the bithack needs to construct those constants every time, so it's a lot worse for a front-end bottleneck. Your benchmark has call/ret overhead for a non-inline function, which may be as much of a throughput limit as the code under test. e.g. https://uica.uops.info/ predicts Skylake will run the SIMD version (with tzcnt/shift) at 2c tput, bottleneck on port 5 from movd+shuffle. — Peter Cordes, Dec 15 '22 at 00:38
@Henry: You probably read something about [mixing legacy SSE *asm* with 256-bit AVX](https://stackoverflow.com/questions/41303780/why-is-this-sse-code-6-times-slower-without-vzeroupper-on-skylake). Compilers avoid any problems with that (even MSVC these days), it's 100% fine to use `__m128i` in programs that use AVX2. Compilers will use the VEX encoding so there isn't even any legacy SSE for code using `__m128i`, and no need for `vzeroupper` between it and code using `__m256i`. — Peter Cordes, Dec 15 '22 at 00:40
@PeterCordes as fun as this was, I guess SIMD will always win unless I find myself in a situation where I ran out of registers/ports — , Dec 15 '22 at 01:08
@Henry: On modern x86, yeah, that's true for this. On Bulldozer-family, the bithack might have some advantages at least for latency. Some AArch64 CPUs might benefit from the bithack, though, especially if not doing it in a loop. (But I think modern AArch64 has efficient SIMD->integer without stalling, for rbit / clz (AArch64 doesn't have tzcnt directly). It can even horizontally narrow a 128-bit compare result to 64-bit, so you could do 2 4-element arrays at once. [Convert vector compare mask into bit mask in AArch64 SIMD or ARM NEON?](https://stackoverflow.com/q/74722950) — Peter Cordes, Dec 15 '22 at 01:20

Fastest way to find 16bit match in a 4 element short array?

2 Answers2

Match position