Profiling _mm_setzero_ps and {0.0f,0.0f,0.0f,0.0f}

Question

EDIT: As Cody Gray pointed out in his comment, profiling with disabled optimization is complete waste of time. How then should i approach this test?

Microsoft in its XMVectorZero in case if defined _XM_SSE_INTRINSICS_ uses _mm_setzero_ps and {0.0f,0.0f,0.0f,0.0f} if don't. I decided to check how big is the win. So i used the following program in Release x86 and Configuration Properties>C/C++>Optimization>Optimization set to Disabled (/Od).

constexpr __int64 loops = 1e9;
inline void fooSSE() {
    for (__int64 i = 0; i < loops; ++i) {
        XMVECTOR zero1 = _mm_setzero_ps();
        //XMVECTOR zero2 = _mm_setzero_ps();
        //XMVECTOR zero3 = _mm_setzero_ps();
        //XMVECTOR zero4 = _mm_setzero_ps();
    }
}
inline void fooNoIntrinsic() {
    for (__int64 i = 0; i < loops; ++i) {
        XMVECTOR zero1 = { 0.f,0.f,0.f,0.f };
        //XMVECTOR zero2 = { 0.f,0.f,0.f,0.f };
        //XMVECTOR zero3 = { 0.f,0.f,0.f,0.f };
        //XMVECTOR zero4 = { 0.f,0.f,0.f,0.f };
    }
}
int main() {
    fooNoIntrinsic();
    fooSSE();
}

I ran the program twice first with only zero1 and second time with all lines uncommented. In the first case intrinsic loses, in the second intrinsic is clear winner. So, my questions are:

Why intrinsic does not always win?
Does the profiler i used is a proper tool for such measurements?

Profiling things with optimization disabled gives you meaningless results and is a complete waste of time. — Cody Gray - on strike, Dec 21 '16 at 17:39
I'm not sure I understand your revised question. How should you approach it? Enable optimization! I guess you have discovered that the optimizer is outsmarting you by removing your benchmark code entirely? — Cody Gray - on strike, Dec 21 '16 at 17:50
@CodyGray yes, that is so, so i did it w/o optimization:) I mean how i can rewrtie this test — Yola, Dec 21 '16 at 17:51

Cody Gray - on strike · Accepted Answer · 2016-12-21T18:35:15.693

Profiling things with optimization disabled gives you meaningless results and is a complete waste of time. If you are disabling optimization because otherwise the optimizer notices that your benchmark actually does nothing useful and is removing it entirely, then welcome to the difficulties of microbenchmarking!

It is often very difficult to concoct a test case that actually does enough real work that it will not be removed by a sufficiently smart optimizer, yet the cost of that work does not overwhelm and render meaningless your results. For example, a lot of people's first instinct is to print out the incremental results using something like printf, but that's a non-starter because printf is incredibly slow and will absolutely ruin your benchmark. Making the variable that collects the intermediate values as volatile will sometimes work because it effectively disables load/store optimizations for that particular variable. Although this relies on ill-defined semantics, that's not important for a benchmark. Another option is to perform some pointless yet relatively cheap operation on the intermediate results, like add them together. This relies on the optimizer not outsmarting you, and in order to verify that your benchmark results are meaningful, you'll have to examine the object code emitted by the compiler and ensure that the code is actually doing the thing. There is no magic bullet for crafting a microbenchmark, unfortunately.

The best trick is usually to isolate the relevant portion of the code inside of a function, parameterize it on one or more unpredictable input values, arrange for the result to be returned, and then put this function in an external module such that the optimizer can't get its grubby paws on it.

Since you'll need to look at the disassembly anyway to confirm that your microbenchmark case is suitable, this is often a good place to start. If you are sufficiently competent in reading assembly language, and you have sufficiently distilled the code in question, this may even be enough for you to make a judgment about the efficiency of the code. If you can't make heads or tails of the code, then it is probably sufficiently complicated that you can go ahead and benchmark it.

This is a good example of when a cursory examination of the generated object code is sufficient to answer the question without even needing to craft a benchmark.

Following my advice above, let's write a simple function to test out the intrinsic. In this case, we don't have any input to parameterize upon because the code literally just sets a register to 0. So let's just return the zeroed structure from the function:

DirectX::XMVECTOR ZeroTest_Intrinsic()
{
    return _mm_setzero_ps();
}

And here is the other candidate that performs the initialization the seemingly-naïve way:

DirectX::XMVECTOR ZeroTest_Naive()
{
    return { 0.0f, 0.0f, 0.0f, 0.0f };
}

Here is the object code generated by the compiler for these two functions (it doesn't matter which version, whether you compile for x86-32 or x86-64, or whether you optimize for size or speed; the results are the same):

ZeroTest_Intrinsic
    xorps  xmm0, xmm0
    ret

ZeroTest_Naive
    xorps  xmm0, xmm0
    ret

(If AVX or AVX2 instructions are supported, then these will both be vxorps xmm0, xmm0, xmm0.)

That is pretty obvious, even to someone who cannot read assembly code. They are both identical! I'd say that pretty definitively answers the question of which one will be faster: they will be identical because the optimizer recognizes the seemingly-naïve initializer and translates it into a single, optimized assembly-language instruction for clearing a register.

Now, it is certainly possible that there are cases where this is embedded deep within various complicated code constructs, preventing the optimizer from recognizing it and performing its magic. In other words, the "your test function is too simple!" objection. And that is most likely why the library's implementer chose to explicitly use the intrinsic whenever it is available. Its use guarantees that the code-gen will emit the desired instruction, and therefore the code will be as optimized as possible.

Another possible benefit of explicitly using the intrinsic is to ensure that you get the desired instruction, even if the code is being compiled without SSE/SSE2 support. This isn't a particularly compelling use-case, as I imagine it, because you wouldn't be compiling without SSE/SSE2 support if it was acceptable to be using these instructions. And if you were explicitly trying to disable the generation of SSE/SSE2 instructions so that you could run on legacy systems, the intrinsic would ruin your day because it would force an xorps instruction to be emitted, and the legacy system would throw an invalid operation exception immediately upon hitting this instruction.

I did see one interesting case, though. xorps is the single-precision version of this instruction, and requires only SSE support. However, if I compile the functions shown above with only SSE support (no SSE2), I get the following:

ZeroTest_Intrinsic
    xorps  xmm0, xmm0
    ret

ZeroTest_Naive
    push   ebp
    mov    ebp, esp
    and    esp, -16
    sub    esp, 16

    mov    DWORD PTR [esp],    0
    mov    DWORD PTR [esp+4],  0
    mov    DWORD PTR [esp+8],  0
    mov    DWORD PTR [esp+12], 0
    movaps xmm0, XMMWORD PTR [esp]

    mov    esp, ebp
    pop    ebp
    ret

Clearly, for some reason, the optimizer is unable to apply the optimization to the use of the initializer when SSE2 instruction support is not available, even though the xorps instruction that it would be using does not require SSE2 instruction support! This is arguably a bug in the optimizer, but explicit use of the intrinsic works around it.

When people make microbenchmarks of a single loop-body by adding results in a loop, they usually end up benchmarking throughput, not latency. Depending on how your real program uses the operation you're tuning, that might be the wrong thing to optimize for. Modern out-of-order-execution CPUs make microbenchmarking *hard*. Once you're CPU-bound without branch mispredicts, there are three simple dimensions to consider: frontend bottlenecks (mostly just fused-domain uop count), execution-unit bottlenecks (OOO throughput), and latency bottlenecks. http://stackoverflow.com/a/40879258/224132 — Peter Cordes, Dec 21 '16 at 19:24
If you are explicitly enabling SSE or SSE2 with MSVC then you must be compiling in 32-bit mode. In my experience Microsoft has put much less effort into optimization for 32-bit mode. I have answer multiple questions on SO where MSVC did something stupid in 32-bit mode but made the optimal choice in 64-bit mode. Unfortunately, MSVC defaults to 32-bit mode even with a 64-bit mode so you have to explicitly enable 64-bit mode. If Microsoft default to 64-bit mode on 64-bit Windows there would be far fewer questions about SIMD with MSVC on SO. — Z boson, Dec 22 '16 at 09:08
Yes, I'm talking about 32-bit code here. That is what was discussed in the original question. It is a non-issue with 64-bit, as SSE2 is always supported. The 32-bit optimizer is much older code, and often does things that seem "stupid" today, but made sense a long time ago. The 64-bit optimizer seems to have been largely rewritten, if not based on entirely new code. I disagree though that defaulting to 64-bit targets makes sense. There are still a *lot* of 32-bit processors out there. And it violates the principle of least surprise to base it on your development machine. — Cody Gray - on strike, Dec 22 '16 at 09:10
Just because I'm developing on an AMD processor doesn't mean I want all of my builds tuned for AMD, etc. And for what it's worth, I haven't seen that much outright stupidity from the 32-bit compiler, compared to the 64-bit compiler, and I've done lots of side-by-side comparisons. And the last update of MSVC 2015 did quite a bit of work further optimizing the 32-bit compiler, bringing it up to date with many of the optimization enhancements that were included in the 64-bit compiler shipped with VS 2010. — Cody Gray - on strike, Dec 22 '16 at 09:12

Profiling _mm_setzero_ps and {0.0f,0.0f,0.0f,0.0f}

1 Answers1