8

I used llvm-mca to compute the total cycles of a pice of code, thinking they would predict its runtime. However, measuring the runtime dynamically showed almost no correlation. So: Why does the total cycles computed by llvm-mca not accurately predict the runtime? Can I predict the runtime in some better way with llvm-mca?


Details:

I wanted to know the runtime of the following code for different types of begin (and end) iterators, for startValue being 0.0 or 0ULL:

std::accumulate(begin, end, starValue)

To predict the runtimes, I used the Compiler Explorer (https://godbolt.org/z/5HDzSF) with its LLVM Machine Code Analyzer (llvm-mca) plugin, since llvm-mca is "a performance analysis tool that uses information available in LLVM (e.g. scheduling models) to statically measure the performance". I used the following code:

using vec_t = std::vector<double>;

vec_t generateRandomVector(vec_t::size_type size)
{
    std::random_device rnd_device;
    std::mt19937 mersenne_engine {rnd_device()};
    std::uniform_real_distribution dist{0.0,1.1};
    auto gen = [&dist, &mersenne_engine](){
        return dist(mersenne_engine);
    };
    vec_t result(size);
    std::generate(result.begin(), result.end(), gen);
    return result;
}

double start()
{
    vec_t vec = generateRandomVector(30000000);
    vec_t::iterator vectorBegin = vec.begin();
    vec_t::iterator vectorEnd = vec.end();
    __asm volatile("# LLVM-MCA-BEGIN stopwatchedAccumulate");
    double result = std::accumulate(vectorBegin, vectorEnd, 0.0);
    __asm volatile("# LLVM-MCA-END");    
    return result;
}

However, I see no correlation between the total cycles computer by llvm-mca and the wall clock time from running the corresponding std::accumulate. For example, in the code above, the Total Cycles are 2806, the runtime is 14ms. When I switch to the startValue 0ULL, the Total Cycles are 2357, but the runtime is 117ms.

DaveFar
  • 7,078
  • 4
  • 50
  • 90
  • 3
    Total CPU cycles have stopped directly corresponding to CPU time around 1990s. There's much more at play nowadays that will impact the time to actually run a piece of code, e.g. caching, branch prediction, external influences etc. Such a static analysis would require not just the code, but also the exact CPU model the code is going to run to even have a chance of being accurate. – Bartek Banachewicz Jan 09 '19 at 09:58
  • Good to know, Bartek (+1). Because the CPU frequency can change, or why? Not directly corresponding is one thing, but no correlation at all (see the numbers in my details) was kind of shocking... – DaveFar Jan 09 '19 at 10:01
  • Performance of native code is a very complex topic nowadays. You'd need to look around for modern resources regarding optimizations; I think looking at cache-friendly data layout is a good, practical starting point; virtually identical programs (most probably with exactly the same instruction length) differing just by the flip of two loops can have vastly different running times. – Bartek Banachewicz Jan 09 '19 at 10:03
  • Isn't the focus on `std::vector` and `std::accumulate` sufficient to have a cache-friendly data layout? – DaveFar Jan 09 '19 at 10:04
  • So it seems the subject is just extremely complex, but I did not misunderstand llvm-mca, i.e. focusing on some other output value that is a more accurate prediction. Right? – DaveFar Jan 09 '19 at 10:07
  • In this specific example, did you just change the value to `0ULL`, or did you also change the type of vector to `unsigned long long`? I am pretty sure the "total cycles" count could well be correct, and not useful at the same time, yes. – Bartek Banachewicz Jan 09 '19 at 10:07
  • Nope, I just sum into `unsigned long long` with an implicit conversion each iteration; the `vector::value_type` stays `double`. – DaveFar Jan 09 '19 at 10:09
  • 1
    Well, okay, in that case it might have to do with the fact that the compiler has to do a lot of conversions; seems that the `ull` version is actually being picked there. [Those two calls to double conversion seem particularly bad](https://godbolt.org/z/FeBUzn), but that's just a guess. What compiler did you build this with to benchmark? – Bartek Banachewicz Jan 09 '19 at 10:17
  • Thank you so much Bartek for looking thoroughly into this. What do you mean by "particularly bad"? Making runtime much slower, but not strongly increasing the CPU cycles would explain the missing correlation. When I look at your example with llvm-mca, including the timeline view (see https://godbolt.org/z/tTCyRW), I find no indication for much slower runtime. Do you find it somewhere in llvm-mca's output? – DaveFar Jan 09 '19 at 10:28
  • Nah, that was just a guess; we're looking at a very sensitive loop body, and in there, sandwiching the most critical operation, are two `call` operations. Those calls (`__ultod3` specifically) *should*, as far as I'm aware, delegate to SSE2 instructions, which are backed by hardware. This is all Microsoft stuff, however, that's why I asked which compiler you used. Even if your compiler uses SIMD for conversions as well, I admit I have absolutely no idea how LLVM-MCA will report such calls (i.e. how many cycles they will count as). – Bartek Banachewicz Jan 09 '19 at 10:36
  • That being said, it seems that [LLVM-MCA](https://reviews.llvm.org/rL331629#change-EYjqdjWuz1Ae) is actually much more potent than I initially thought; having all those processor architectures listed is a good indication they *might* actually interpret that properly. I don't think I can help you further; someone from LLMV-MCA would be best to shine some light on that. – Bartek Banachewicz Jan 09 '19 at 10:42
  • 2
    If you look at the bottom of godbolt.org/z/tTCyRW's llvm-mca output, you see `error: invalid instruction mnemonic 'cvtsi2sdq'` and `call instructions are not correctly modeled. Assume a latency of 100cy`. Maybe that is leading to bad measurements. However, in my original example, there are no call instructions in the measured assembly. I will forward this question to someone from LLVM-MCA... – DaveFar Jan 09 '19 at 10:57
  • 1
    @BartekBanachewicz: a contiguous array (like `std::vector` or an array) is as good as it gets for data layout. Aligning the data for SIMD can help a bit, depending on the loop. As far as resources for [static analysis of performance](https://stackoverflow.com/q/51607391) of a loop on modern x86, https://stackoverflow.com/tags/x86/info has links to lots of good stuff, especially Agner Fog's microarch guide (http://agner.org/optimize/), which really does detail the pipeline / OoO exec of Intel and AMD CPUs in enough detail to make accurate predictions of performance for a lot of cases. – Peter Cordes Jan 10 '19 at 05:02
  • 1
    Total cycles and time are linearly correlated if the CPU is running at constant frequency. Other things, like total (dynamic) instructions is very much variable, for superscalar / out-of-order pipelined CPUs, especially when cache-misses can stall the pipeline for hundreds of cycles. Dynamic CPU frequency (for power-saving at idle / low workload) became a thing in the 2000s, not as early as the 90s. It can be a problem for microbenchmarking if your test doesn't "warm up" the CPU to get it to max turbo first, but that's not the problem here. (LLVM-MCA is analyzing code outside the loop.) – Peter Cordes Jan 10 '19 at 05:12

1 Answers1

7

TL:DR: LLVM-MCA analyzed the entire chunk of code between those comments as if it were the body of a loop, and showed you the cycle count for 100 iterations of all those instructions.

But as well as the actually (tiny) loop, most of the instructions are loop setup and the SIMD horizontal sum after the loop that actually only run once. (That's why the cycle count is in the thousands, not 400 = 100 time the 4-cycle latency of vaddpd on Skylake for the 0.0 version with a double accumulator.)

If you uncheck the "//" box on the Godbolt compiler explorer, or modify the asm statements to add a nop like "nop # LLVM-MCA-END", you'll be able to find those lines in the asm window and see what LLVM-MCA was looking at as it's "loop".


LLVM MCA simulates a specified sequence of assembly instructions and calculates the number of cycles it's estimated to take to execute per iteration on the specified target architecture. LLVM MCA makes a number of simplifications such as (off the top of my head): (1) it assumes that all conditional branches fall through, (2) it assumes that all memory accesses are of the Write Back memory type and all hit in the L1 cache, (3) it assumes that the frontend works optimally, and (4) call instructions are not followed into the called procedure and they just fall through. There are other assumptions as well that I can't recall at the moment.

Essentially, LLVM MCA (like Intel IACA) is only accurate for backend-compute-bound, simple loops. In IACA, while most instructions are supported, a few instructions are not modeled in detail. As an example, prefetch instructions are assumed to consume only microarchitectural resources but take basically zero latency and have no impact on the state of the memory hierarchy. It appears to me that MCA completely ignores such instructions, however. Anyway, this is not particularly relevant to your question.

Now back to your code. In the Compiler Explorer link you have provided, you didn't pass any options to LLVM MCA. So the default target architecture takes effect, which is whatever architecture the tool is running on. This happens to be SKX. The total number of cycles you mentioned are for SKX, but it's not clear whether you ran the code on SKX or not. You should use the -mcpu option to specify the architecture. This is independent from the -march you passed to gcc. Also note that comparing core cycles to milliseconds is not meaningful. You can use the RDTSC instruction to measure the execution time in terms of core cycles.

Notice how the compiler has inlined the call to std::accumulate. Apparently, this code starts at assembly line 405 and the last instruction of the std::accumulate is at line 444, a total of 38 instructions. The reason why the LLVM MCA estimation will not match the actual performance has become clear now. The tool is assuming that all of these instructions are being executed in a loop for a large number of iterations. Obviously this is not the case. There is only one loop from 420-424:

.L75:
        vaddpd  ymm0, ymm0, YMMWORD PTR [rax]
        add     rax, 32
        cmp     rax, rcx
        jne     .L75

Only this code should be the input to MCA. At the source code level, there is really no way to tell MCA to only analyze this code. You'd have to manually inline std::accumulate and place the LLVM-MCA-BEGIN and LLVM-MCA-END marks somewhere inside it.

When passing 0ULL instead of 0.0 to std::accumulate, the input to LLVM MCA would start at assembly instruction 402 and end at 441. Note that any instructions not supported by MCA (such as vcvtsi2sdq) will be completely omitted from the analysis. The part of the code that is actually in a loop is:

.L78:
        vxorpd  xmm0, xmm0, xmm0
        vcvtsi2sdq      xmm0, xmm0, rax
        test    rax, rax
        jns     .L75
        mov     rcx, rax
        and     eax, 1
        vxorpd  xmm0, xmm0, xmm0
        shr     rcx
        or      rcx, rax
        vcvtsi2sdq      xmm0, xmm0, rcx
        vaddsd  xmm0, xmm0, xmm0
.L75:
        vaddsd  xmm0, xmm0, QWORD PTR [rdx]
        vcomisd xmm0, xmm1
        vcvttsd2si      rax, xmm0
        jb      .L77
        vsubsd  xmm0, xmm0, xmm1
        vcvttsd2si      rax, xmm0
        xor     rax, rdi
.L77:
        add     rdx, 8
        cmp     rsi, rdx
        jne     .L78

Note that there is a conditional jump, jns, in the code whose target address is somewhere in the block. MCA will just assume that the jump will fall through. If this was not the case in an actual run of the code, MCA will unnecessarily add the overhead of 7 instructions. There is another jump, jb, but this one I think is not important for large vectors and will fall through most of the time. The last jump, jne, is also the last instruction, so MCA will assume that the next instruction is the top one again. For a sufficiently large number of iterations, this assumption is perfectly fine.

Overall, it's obvious that first code is much smaller than the second one, and so it's probably much faster. Your measurements do confirm this. You also don't really need to use a microarchitecture analysis tool to understand why. The second code just does a lot more computations. So you can quickly conclude that passing 0.0 is better in terms of performance and code size on all architectures.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Hadi Brais
  • 22,259
  • 3
  • 54
  • 95
  • Not only is the `double` (picked with `0.0`) version faster, it's [actually correct](http://coliru.stacked-crooked.com/a/a267256083954d4f). The intermediate conversions can provide quite unexpected results. – Bartek Banachewicz Jan 09 '19 at 20:32
  • 2
    @BartekBanachewicz Note that `-ffast-math` was passed to gcc to generate the code. See [What does gcc's ffast-math actually do?](https://stackoverflow.com/questions/7420665/what-does-gccs-ffast-math-actually-do). I have not examined the code thoroughly, but I do hope that both versions are correct with respect to `-ffast-math` (yes they may not produce the same result when using `-ffast-math`, that's OK) . – Hadi Brais Jan 09 '19 at 20:39
  • 1
    @HadiBrais: an easy way to see (on Godbolt) where the start/end markers are: add a `nop", like `nop # LLVM-MCA-END` into the asm statement. Unchecking the "//" box on Godbolt works too so you can see the comment lines, but having an actual `nop` makes the "scroll to assembly" right-click menu option work. – Peter Cordes Jan 10 '19 at 05:19
  • 1
    LLVM-MCA isn't even accurate for simple backend bound compute loops: because it has incorrect parameters like "width = 6" for Skylake, so it thinks Skylake can execute 6 nops (or other mixes of instructions) in 1 cycle. This often makes the answer blatantly wrong compared to IACA. – BeeOnRope Jun 05 '19 at 00:35