2

I'm profiling an application using Intel VTune, and there is one particular hotspot where I'm copying a __m128i member variable in the copy constructor of a C++ class.

VTune gives this breakdown:

Instruction                  CPU Time: Total        CPU Time: Self

Block 1:
vmovdqa64x (%rax), %xmm0     4.1%                   0.760s
add $0x10, %rax              46.6%                  8.594s

Block 2:
vmovapsx %xmm0, -10x(%rdx)   6.5%                   1.204s

(If it matters, compiler is gcc 7.4.0)

I admit I'm an assembly noob, but it's very surprising that one particular add instruction is taking up 46% of my application time, given that the app is doing lots of other complex things and add is such a trivial operation.

Am I misinterpreting the profiling output somehow? Is there a path to optimize this other than "copy that variable less"?

phuclv
  • 37,963
  • 15
  • 156
  • 475
Thomas Johnson
  • 10,776
  • 18
  • 60
  • 98
  • is that implying that adding 0x10 to rax is taking up 8.594 seconds? A single instruction? – Irelia Feb 14 '20 at 19:17
  • 1
    I wouldn't use VTune to examine the assembly, but instead the C++ code... I'd doubt that the `add` itself is taking that long. – ChrisMM Feb 14 '20 at 19:18
  • The corresponding C++ code is `*this = other`, and the class only holds a `__m128i` member variable. – Thomas Johnson Feb 14 '20 at 19:19
  • Obvious question: are you testing an optimised build or a default Debug build? Testing performance of debug builds is somewhat pointless. – Jesper Juhl Feb 14 '20 at 19:20
  • 1
    Optimized, with `-O3 -march=native` – Thomas Johnson Feb 14 '20 at 19:21
  • I don't think you can rely on performance tools to show you the exact assembler instruction you are spending time on, that is just too much to ask. In general, they will show you roughly where you're spending time, and it will be up to you to figure out exactly what is going on. Modern CPUs are EXTREMELY complicated, expecting a simple tool to pinpoint things like that is not realistic. Understanding what's going on is an art and a science. I recommend giving a broader description of what your code is doing, and how you've implemented it, then we might have a chance of helping. – DeducibleSteak Feb 14 '20 at 19:43
  • 3
    It's common for benchmarking software to blame the instruction *after* a load for the slowness of that load. That still doesn't make a lot of sense to me, but it happens – harold Feb 14 '20 at 20:00
  • 2
    @harold has a point about an instruction getting blamed for using time that's actually being spent fetching data from memory, by that instruction or an instruction near it. Also, I've never experienced this myself, but I've heard that instruction counters and SIMD instructions don't mix well - you can probably expect your measurements to be even less precise around them. You really need to think about what your code is doing near where your bottleneck is reported, you can't rely on a tool isolating the exact cause to an instruction level. – DeducibleSteak Feb 14 '20 at 20:06
  • 1
    @DeducibleSteak: SIMD instructions aren't special or worse for HW performance counters. Perhaps you read about auto-vectorization and other optimizations making profiling less sensible in terms of C++ *source* lines? Anyway, yes, counters like `core_cycles` tend to have some skew and grouping of counts, and/or blame a uop waiting for data instead of the thing that was slow to produce it. Events like `mem_load_retired.l1_miss` are precise, and useful once you have a guess why something might be slow. – Peter Cordes Feb 15 '20 at 05:58
  • Related: [Inconsistent \`perf annotate\` memory load/store time reporting](https://stackoverflow.com/q/65906312) re: HW perf counters "blaming" an instruction waiting for a slow result, rather than the cache-miss load itself. Blaming the instruction after (skew) is a different thing, perhaps because the CPU waits for the oldest instruction to retire when an interrupt is signalled, to ensure forward progress. And if that was a cache-miss load that was stalling execution, then it dodges blame. – Peter Cordes Jul 26 '23 at 18:00
  • Related: [Unexpectedly slow performance on moving from register to frequently accessed variable](https://stackoverflow.com/q/76773814) is just a store waiting for an L3-miss load+add, with the store getting all the "blame". – Peter Cordes Jul 26 '23 at 18:36

0 Answers0