2

I'm trying to do some comparisons on different methods for calculating dot products using SSE Intrinsics, but since the methods are only a few cycles long, I have to run the instructions trillions of times for it to take more than a tiny fraction of a second. The only problem with that is that gcc with the -O3 flag is "optimizing" my main method into an infinite loop.

My code is

#include <immintrin.h>
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <inttypes.h>
#define NORMAL 0

struct _Vec3 {
    float x;
    float y;
    float z;
    float w;
};

typedef struct _Vec3 Vec3;

__m128 singleDot(__m128 a, __m128 b) {
    return _mm_dp_ps(a, b, 0b00001111);
}

int main(int argc, char** argv) {
    for (uint16_t j = 0; j < (1L << 16); j++) {
        for (uint64_t i = 0; i < (1L << 62); i++) {
            Vec3 a = {i, i + 0.5, i + 1, 0.0};
            Vec3 b = {i, i - 0.5, i - 1, 0.0};
            #if NORMAL
            float ans = normalDot(a, b); // naive implementation
            #else
            // float _c[4] = {a.x, a.y, a.z, 0.0};
            // float _d[4] = {b.x, b.y, b.z, 0.0};
            __m128 c = _mm_load_ps((float*)&a);
            __m128 d = _mm_load_ps((float*)&b);
            __m128 ans = singleDot(c, d);
            #endif
        }
    }
}

but when I compile with gcc -std=c11 -march=native -O3 main.c and run objdump -d, it turns main into

0000000000400400 <main>:
  400400:   eb fe                   jmp    400400 <main>

is there an alternative for timing different approaches?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Calvin Godfrey
  • 2,171
  • 1
  • 11
  • 27
  • For small code snippets an alternative can be to use static code analyzers such as [Intel Architecture Code Analyzer](https://software.intel.com/content/www/us/en/develop/articles/intel-architecture-code-analyzer.html), or [LLVM Machine Code Analyzer](https://llvm.org/docs/CommandGuide/llvm-mca.html). – chtz May 27 '20 at 09:30

2 Answers2

2

That's because this:

for (uint16_t j = 0; j < (1L << 16); j++) {

is an infinte loop -- the maximum value for a uint16_t is 65535 (216-1), after which it will wrap back to 0. So the test will always be true.

Chris Dodd
  • 119,907
  • 13
  • 134
  • 226
  • Oohh, that's embarassing. Meant that to be a uint64_t – Calvin Godfrey May 27 '20 at 04:48
  • @CalvinGodfrey: Fixing would still optimize away the work to calculate an unused result. But even fixing that with Google Benchmark's DoNotOptimize empty inline asm or something would leave the more fundamental problem of whether you're timing throughput or latency, depending on which is more relevant in the real use case. – Peter Cordes May 27 '20 at 04:50
  • 1
    The compiler should have warned about this, enable more compiler warnings. – Bernd Elkemann May 27 '20 at 04:51
2

Even after fixing the uint16_t instead of uint64_t typo that makes your loop infinite, the actual work would still be optimized away because nothing uses the result.

You can use Google Benchmark's DoNotOptimize to stop your unused ans result from being optimized away. e.g. functions like "Escape" and "Clobber" that this Q&A is asking about. That works in GCC, and that question links to a relevant youtube video from a clang developer's CppCon talk.

Another worse way is to assign the result to a volatile variable. But keep in mind that common-subexpression elimination can still optimize away earlier parts of the calculation, whether you use volatile or an inline-asm macro to make sure the compiler materializes the actual final result somewhere. Micro-benchmarking is hard. You need the compiler to do exactly the amount of work that would happen in the real use-case, but not more.

See Idiomatic way of performance evaluation? for that and more.


Keep in mind exactly what you're measuring here.

Probably a bunch of loop overhead and probably store-forwarding stalls depending on whether the compiler vectorizes those initializers or not, but even if it does; conversion of integer to FP and 2x SIMD FP additions are comparable in cost a dpps in terms of throughput cost. (Which is what you're measuring, not latency; the difference matters a lot on CPUs with out-of-order execution depending on the context of your real use case).

Performance is not 1-dimensional at the scale of a couple instructions. Slapping a repeat loop around some work can measure the throughput or latency, depending on whether you make the input dependent on the previous output (a loop-carried dependency chain). But if your work ends up bound on front-end throughput, then loop overhead is an important part. Plus you might end up with effects due to how the machine code for your loop lines up with 32-byte boundaries for the uop cache.

For something this short and simple, static analysis is usually good. Count uops for the front-end, and ports in the back end, and analyze latency. What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?. LLVM-MCA can do this for you, so can IACA. You can also measure as part of your real loop that uses dot products.

See also RDTSCP in NASM always returns the same value for some discussion of what you can measure about a single instruction.


I have to run the instructions trillions of times for it to take more than a tiny fraction of a second

Current x86 CPUs can loop at best one iteration per clock cycle for a tiny loop. It's impossible to write a loop that runs faster than that. 4 billion iterations (in asm) will take at least a whole second on a 4GHz CPU.

Of course an optimizing C compiler could unroll your loop and be doing as many source iterations as it wants per asm jump.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847