Clang generates worse code for 7 comparisons than for 8 comparisons

Question

I was intrigued by clang's ability to convert many == comparisons of small integers to to one big SIMD instruction, but then I noticed something strange. Clang generated "worse" code(in my amateur evaluation) when I had 7 comparisons compared to the code when I had 8 comparisons.

bool f1(short x){
    return (x==-1) | (x == 150) |
           (x==5) | (x==64) | 
           (x==15) | (x==223) | 
           (x==42) | (x==47);
}

bool f2(short x){
    return (x==-1) | (x == 150) |
           (x==5) | (x==64) | 
           (x==15) | (x==223) | 
           (x==42);
}

My question is this a small performance bug, or clang has a very good reason for not wanting to introduce a dummy comparison(i.e. pretend that there is one extra comparison with one of the 7 values) and use one more constant in the code to achieve it.

godbolt link here:

# clang(trunk) -O2 -march=haswell
f1(short):
    vmovd   xmm0, edi
    vpbroadcastw    xmm0, xmm0             # set1(x)
    vpcmpeqw        xmm0, xmm0, xmmword ptr [rip + .LCPI0_0]  # 16 bytes = 8 shorts
    vpacksswb       xmm0, xmm0, xmm0
    vpmovmskb       eax, xmm0
    test    al, al
    setne   al           # booleanize the parallel-compare bitmask
    ret

vs.

f2(short):
    cmp     di, -1
    sete    r8b
    cmp     edi, 150
    sete    dl
    cmp     di, 5             # scalar checks of 3 conditions
    vmovd   xmm0, edi
    vpbroadcastw    xmm0, xmm0
    vpcmpeqw        xmm0, xmm0, xmmword ptr [rip + .LCPI1_0]  # low 8 bytes = 4 shorts
    sete    al
    vpmovsxwd       xmm0, xmm0
    vmovmskps       esi, xmm0
    test    sil, sil
    setne   cl                # SIMD check of the other 4
    or      al, r8b
    or      al, dl
    or      al, cl            # and combine.
    ret

quickbench does not seem to work because IDK how to provide -mavx2 flag to it. (Editor's note: simply counting uops for front-end cost shows this is obviously worse for throughput. And also latency.)

Unless you have actual numbers you can't say if it is worse or not. Run the code through a benchmarker and see if it really is slower or not. Then you have a much better question to go forward with. — NathanOliver, Sep 23 '19 at 20:16
Is there a reason you're using bitwise OR (`|`) instead of logical OR (`||`) here? — Fred Larson, Sep 23 '19 at 20:22
If you replace the `|` with `||` (which I am pretty sure is what you want to) you end up with effectively the same code. — SergeyA, Sep 23 '19 at 20:28
| is for performance, it is an old Agner Fog trick, you do extra work but sometimes it pays of because sometimes it avoids branching. And in any case my question is about my code, clang should optimize it, even if it is not "ordinary". — NoSenseEtAl, Sep 23 '19 at 20:31
@FredLarson Well interestingly, if you use logical OR, clang doesn't notice that the comparisons have no side effects, so it doesn't need to worry about evaluating them unnecessarily - so it uses lots of comparisons and jumps. (At least for f1) — Martin Bonner supports Monica, Sep 23 '19 at 20:31
@NathanOliver: did you look at the asm? The code for `f2` basically includes the meat of `f1`, plus more scalar work. (`vpacksswb xmm` has the same cost as `vpmovsxwd xmm` on mainstream Intel and AMD CPUs, like other single-uop shuffles). Clearly a missed optimization vs. just repeating one of the numbers to bring it up to the vector width. This should get reported as a clang/LLVM optimizer bug. (https://bugs.llvm.org/) — Peter Cordes, Sep 23 '19 at 20:35
Very interesting code generation! I suppose it could have generated the same code for `f2` as for `f1`, by simply duplicating one of the existing constants in the constant list. But instead they dumbed it down to only using 4 parallel compares and doing the others serially. — Erik Eidt, Sep 23 '19 at 20:37
@MartinBonner: On the other hand, with logical OR it would perform short circuiting, which could eliminate some of the comparisons. — Fred Larson, Sep 23 '19 at 20:51
@FredLarson: yes, potentially misleading the compiler away from finding this SIMD optimization that checks them all in parallel (in somewhat more time than one scalar check would take, but much less than 7 or 8 scalar checks). Either way the expressions are equivalent so *in theory* a compiler can optimize either one either way, and we should write in a more idiomatic way that humans are used to, especially if the compiler we care about still handles it well. But anyway, `|` vs. `||` isn't going to be the difference in which SIMD strategy it chooses to use. — Peter Cordes, Sep 23 '19 at 21:38
@PeterCordes I disagree, *potentially*. It was thought in college(10+y ago) to put most common case first(when using || ) and with | I am telling compiler that they are all equally likely. Now without going through source or getting some compiler writer to tell us the truth this is just a guess, but it could be that there is this convention where compilers assume first conditions are the most likely to be true and | tells them to not use that heuristic https://softwareengineering.stackexchange.com/questions/223086/should-if-else-statements-be-arranged-by-rareness-of-cases-or-difficulty-of-deal — NoSenseEtAl, Sep 23 '19 at 22:06
@NoSenseEtAl: Well yeah, in cases where they can't all be checked in parallel then you want to put the most common first so short-circuit eval stops sooner. Even when conditions have no side-effects (so they *can* be reordered) compilers do typically check them in source order when they implement it as a sequence of compute something -> conditional branch. So actually you want to order them by some combination of cheapness to check vs. chance of skipping the rest, modulo branch prediction effects if they're all *fairly* cheap. — Peter Cordes, Sep 23 '19 at 22:12
I think `|` is more likely to result in branchless code, but I haven't looked at asm for `|` between conditions with gcc/clang. Here you get branchless either way because it's cheap and because you're in a function that returns a boolean, no conditional execution. It doesn't sound accurate to describe it as "disabling that heuristic", but that inaccurate mental model might still get the desired result sometimes. (e.g. branchless combining of multiple conditions, maybe good on some ISAs that can do that more cheaply than x86, and when the combination predicts better than each piece.) — Peter Cordes, Sep 23 '19 at 22:14
Bonus question: if you extend `f2` to have an additional comparison: `| (x==VALUE)`, what happens? For `VALUE==41`, clang still generates a suboptimal code (it's not like `f1`). For VALUE==10, 34, 40, 43, 46, 58, 106, 170, 298 or 554, it generates another kind of suboptimal code. I suppose that this part of clang is under development, as clang-8 emits a different code. — geza, Sep 23 '19 at 22:41
@geza excelent catch... I would understand it for 41, since he can do smart things for adjacent values, but others probably mean it is just a buggy optimizer. — NoSenseEtAl, Sep 23 '19 at 23:02
@geza: xd, yup looks like the range-check optimizer kicks in, and the SIMD compare doesn't consider re-expanding short ranges. Another thing to mention in a bug report that will prod LLVM devs to take another look at this optimizer pass. — Peter Cordes, Sep 24 '19 at 02:02
gcc uses a bit look-up table and results in much better code than the vectorized version in Clang if there are more values to check and they're all lie within a small range of 64 https://godbolt.org/z/fyBkBn — phuclv, Nov 20 '19 at 11:04

score 5 · Accepted Answer · answered Sep 23 '19 at 22:30

It looks like clang's optimizer didn't think of duplicating an element to bring it up to a SIMD-convenient number of comparisons. But you're right, that would be better than doing extra scalar work. Clearly a missed optimization which should get reported as a clang/LLVM optimizer bug. https://bugs.llvm.org/

The asm for f1() is clearly better than f2(): vpacksswb xmm has the same cost as vpmovsxwd xmm on mainstream Intel and AMD CPUs, like other single-uop shuffles. And if anything vpmovsx -> vmovmskps could have bypass latency between integer and FP domains¹.

Footnote 1: Probably no extra bypass latency on mainstream Intel CPUs with AVX2 (Sandybridge-family); integer shuffles between FP ops are typically fine, IIRC. (https://agner.org/optimize/). But for an SSE4.1 version on Nehalem, yes there might be an extra penalty the integer version wouldn't have.

You don't need AVX2, but word-broadcast in one instruction without a pshufb control vector does make it more efficient. And clang chooses pshuflw -> pshufd for -march=nehalem

Of course, both versions are sub-optimal. There's no need to shuffle to compress the compare result before movemask.

Instead of test al, al, it's possible to select which bits you want to check with test sil, 0b00001010 for example, to check bits 1 and 3 but ignore non-zero bits in other positions.

pcmpeqw sets both bytes the same inside a word element so it's fine to pmovmskb that result and get an integer with pairs of bits.

There's also zero benefit to using a byte register instead of a dword register: test sil,sil should avoid the REX prefix and use test esi,esi.

So even without duplicating one of the conditions, f2() could be:

f2:
    vmovd           xmm0, edi
    vpbroadcastw    xmm0, xmm0             # set1(x)
    vpcmpeqw        xmm0, xmm0, xmmword ptr [rip + .LCPI0_0]
    vpmovmskb       eax, xmm0
    test    eax, 0b011111111111111    # (1<<15) - 1 = low 14 bits set
    setne   al
    ret

That test will set ZF according to the low 14 bits of the pmovmksb result, because the higher bits are cleared in the TEST mask. TEST = AND that doesn't write its output. Often useful for selecting parts of a compare mask.

But since we need a 16-byte constant in memory in the first place, yes we should duplicate one of the elements to pad it up to 8 elements. Then we can use test eax,eax like a normal person. Compressing the mask to fit in 8-bit AL is a total waste of time and code-size. test r32, r32 is just as fast as test r8,r8 and doesn't need a REX prefix for SIL, DIL, or BPL.

Fun fact: AVX512VL would let us use vpbroadcastw xmm0, edi to combine the movd and broadcast.

Or to compare only 4 elements, instead of extra shuffling for movmskps, we only need SSE2 here. And using a mask truly is useful.

test_4_possibilities_SSE2:
    movd            xmm0, edi
    pshufd          xmm0, xmm0, 0             # set1_epi32(x)
    pcmpeqw         xmm0, [const]             # == set_epi32(a, b, c, d)
    pmovmskb        eax, xmm0
    test    eax, 0b0001000100010001     # the low bit of each group of 4
    setne   al
    ret

We do a dword broadcast and ignore the compare result in the high 16 bits of each 32-bit element. Using a mask for test lets us do that more cheaply than any extra instruction would.

Without AVX2, a SIMD dword broadcast with pshufd is cheaper than needing a word broadcast.

Another option is to imul with 0x00010001 to broadcast a word into a 32-bit register, but that has 3 cycle latency so it's potentially worse than punpcklwd -> pshufd

Inside a loop, though, it would be worth loading a control vector for pshufb (SSSE3) instead of using 2 shuffles or an imul.

Clang generates worse code for 7 comparisons than for 8 comparisons

1 Answers1

Linked