Can I make C++ generate cmpps instruction without inline assembly?

Question

I hoped modern C++ compiler will generate fastest possible machine code. Or will we still stuck with inline assembly in 2016 ? I need to search floating point 3D bounding boxes intersecting one particular box. My idea of assembler version is

1. cmpps => register A
2. cmpss => register B 
3. C = A & B
4. convert C into "normal" register
5. compare it with zero
6. conditionally jump

Even after padding structure with 4th float both GCC 4.4 and Visual Studio Comunity 2015 only generate one float compare at a time using comiss instruction. Is there partical C++ expression order required or these compiler are not able to optimize it on their own ?

My test case:

struct Vec
{
    float x,y,z,w;
};

struct BBox
{
    Vec min,max;
};

bool bbox(BBox& b, Vec& v)
{
    return
        b.min.x <= v.x && v.x <= b.max.x &&
        b.min.y <= v.y && v.y <= b.max.y &&
        b.min.z <= v.z && v.z <= b.max.z &&
        b.min.w <= v.w && v.w <= b.max.w;
}

int main()
{
    BBox b;
    Vec v;
    return bbox(b, v);
}

Current GCC is [GCC 5](http://gcc.gnu.org/gcc-5/).3 in feb. 2016. And how do you invoke it? Did you try `gcc -ffast-math -O3 -mcpu=native` ? — Basile Starynkevitch, Feb 29 '16 at 06:35
Thanks for quick reply. I invoked GCC just with -S parameter. I only have GCC4 on Debian. Now I tried it on GCC 5.2 in latest Ubuntu. Parameter -ffast-math only changed ucomiss to comiss. Compares are still scalar. — martins, Feb 29 '16 at 06:49
BTW, why do you care about some particular machine code... What matters is the performance... with only `gcc -S` no optimization is done. Try `gcc -S -fverbose-asm -ffast-math -mcpu=native -O3` if you want to look into the generated assembler code. — Basile Starynkevitch, Feb 29 '16 at 07:17
@ Basile S. I've tried gcc -ffast-math -O3 -mcpu=native as you suggested. — martins, Feb 29 '16 at 07:59
Inline assembly hasn't been needed for years. You can use [intrinsics](http://stackoverflow.com/tags/intrinsics/info) to write code that's portable to all major compilers for x86. This one looks like a good case for `cmpltps`, `andps`, `movmaskps`, and then test the integer result to make sure it has all three bits set. (mask off the bit for comparing the padding, and test `== 0b111`). It looks like this single function doesn't auto-vectorize, unfortunately. It might possibly in a loop, but still probably not. — Peter Cordes, Feb 29 '16 at 08:09
@Basile: You might be right about performance. Initally I hoped in an single "checkbbox" instruction with output to flags register. Vector compare seem to be closest match but 7 instruction are required (2x load, 2x compare, 1x and, 1x conversion, 1x alu compare) which cannot run i parallel because they are depended on previous result. GCC a and MSVC solution seem to be actually faster because there are 24 instruction (6x load, 6x comiss, 6x conditional jump) for 3D box but they can run in just 3 cycles if CPU has six SSE execution units. Comiss has advantage of directly writing flags. — martins, Feb 29 '16 at 08:13
I suggest profiling it, or at least running a little performance test in a sandbox. In tests I ran a while back, I saw little to no speed increase out of using the packed comparison instructions. It was fastest to use a (non-branching) series of individual comparisons. The compilers probably know this; it is exactly what they do if you change the short-circuiting `&&` to a bitwise `&`. — Cody Gray - on strike, Feb 29 '16 at 08:15
Highly related: see my answer on http://stackoverflow.com/questions/35445192/is-it-possible-to-check-if-any-of-2-sets-of-3-ints-is-equal-with-less-than-9-com for an example of vectorizing a compare. If you're only branching on the comparison, latency only adds to the branch mispredict penalty. It might not ever become a data dependency. 6x conditional branches sucks a lot, and `comiss` is a 2-uop instruction on SnB/IvB. The two vector loads can run in parallel (one of them fused into the compare insn). — Peter Cordes, Feb 29 '16 at 08:15
@Peter Cordes: I'm aware of intrinsics and I'm always use them if possible. In my question I counted intrinsics as assembly because they are partially machine specific and they are compiler specific. For me intrinsics are just nices assembly. — martins, Feb 29 '16 at 08:16
Aside from the question, whether cmpps is faster, your code has undefined behavior, as it doesn't initialize `b` and `v`, But what I find most depressing is, that even then, compiler doesn't just compile the whole testcase away, as the result should be known at compiletime. — MikeMB, Feb 29 '16 at 08:18
@MikeMB: I didn't even look at the code for `main()`. I looked at the code for `bbox()`. Never use `main` to see how gcc compiles something: it marks it "cold" and optimizes less. Just write functions that take input parameters and return a result. — Peter Cordes, Feb 29 '16 at 08:46
@PeterCordes: It isn't really relevant for the question but I noticed it while comparing the output of different compilers (clang does eliminate the whole function call). Also I think, if one provides an mcve at all, it should be free of unrelated bugs if possible, hence my previous comment. Great answer by the way — MikeMB, Feb 29 '16 at 22:49
As we talk about code elimination, I'te tested templates and switch for deserilization of proprietary network messages. It was similar to Win32 event handling. There is one base class with lots of empty handlers, one template function with huge switch to call handler methods based on message code. Each case statement casts message pointer to appropriate structure and then calls some method of supplied handler object. Handler class implements a few methods and inherits the rest from base (base methods are empty). — martins, Mar 01 '16 at 17:11
Both G++ and MSVC worked as I've wished: no empty methods had been generated. But MSVC generated empty switch cases in jump table and these all were equals pointers. In C++ point of view, MSCV make empty cases with just breaks. G++ only generated meaningful cases, so in my eyes G++ is clear winner over M$. — martins, Mar 01 '16 at 17:11

Peter Cordes · Answer 1 · 2023-06-01T18:33:41.667

It's too bad gcc / clang don't autovectorize this, because it's pretty easy (Godbolt - clang 3.7):

// clang doesn't let this be a constexpr to mark it as a pure function :/
const bool bbox(const BBox& b, const Vec& v)
{
    // if you can guarantee alignment, then these can be load_ps
    // saving a separate load instruction for SSE.  (AVX can fold unaligned loads)
    // maybe make Vec a union with __m128
    __m128 blo = _mm_loadu_ps(&b.min.x);
    __m128 bhi = _mm_loadu_ps(&b.max.x);
    __m128 vv  = _mm_loadu_ps(&v.x);

    blo = _mm_cmple_ps(blo, vv);
    bhi = _mm_cmple_ps(vv, bhi);
    __m128 anded = _mm_and_ps(blo, bhi);

    int mask = _mm_movemask_ps(anded);
    // mask away the result from the padding element,
    // check that all the bits are set
    return (mask & 0b0111) == 0b0111;
}

This compiles to

    movups  xmm0, xmmword ptr [rdi]
    movups  xmm1, xmmword ptr [rdi + 16]
    movups  xmm2, xmmword ptr [rsi]
    cmpleps xmm0, xmm2
    cmpleps xmm2, xmm1
    andps   xmm2, xmm0
    movmskps        eax, xmm2
    and     eax, 7
    cmp     eax, 7
    sete    al
    ret

If you invert the sense of the comparison (cmpnle), to test for being outside the bounding box on any axis, you could do something like

int mask1 = _mm_movemask_ps(blo);
int mask2 = _mm_movemask_ps(bhi);
return !(mask1 | mask2);

which might compile to

movmskps
movmskps
or
setnz

So the integer test is cheaper, and you replace a vector AND with another movmsk (about equal cost).

I was thinking for a while that doing it that way would mean a NaN counted as inside the box, but actually cmpnleps is true when one of the operands in NaN. (cmpleps is false in this case, so it really is the opposite).

I haven't thought through what happens to the padding in this case. It might end up being !((mask1|mask2) & 0b0111), which is still more efficient for x86, because the test instruction does an AND for free, and can macro-fuse with a branch instruction on Intel and AMD.

movmskps is 2 m-ops and high-latency on AMD, but using vectors is probably still a win. Two movmskps instructions might be slightly worse on AMD than the code I posted first, but it is pipelined so they can both be transferring after the cmpps instructions finish.

score 1 · Answer 2 · edited May 23 '17 at 12:23

If you really want precisely the CMPPS machine instruction, use the __builtin_ia32_cmpps builtin.

But I think you probably don't need to do so. Trust your compiler and ask for many optimizations with gcc -ffast-math -mcpu=native -O3 ; perhaps add some other optimization flags and consider link-time optimizations, e.g. compiling & linking wtih gcc -flto -ffast-math -mcpu=native -O3

If you have weeks to lose trying hand-tuning optimizations (is your time so cheap to worth that effort?), I would suggest tuning the cache prefetch, by carefully and cleverly adding a few __builtin_prefetch (but then, benchmark carefully with runs lasting more than a second and add very few prefetches!).

^{Hand-written and premature optimizations is often useless and counterproductive}

BTW, compiler writers are making some small but continuous progresses. Testing with the latest versions of GCC and Clang/LLVM is probably a better way to spend your time.

Don't forget the typical order of magnitude for timing of operations in a PC.

Can I make C++ generate cmpps instruction without inline assembly?

2 Answers2

Linked