Why "setne %al" used "a lot of cycles" in perf annotation?

Question

I was very confused when I saw this perf report. I have tried it for several times, and this setne instruction always takes the most in the function. The function is a big function and below just shows a small piece of the function.

The report is produced with:

perf record ./test

And I check the perf result with:

perf report --showcpuutilization

I opened annotation for one of my most cost functions, which is very large, and small piece is shown in the figure:

From it, we can see the setne instruction (on about line 10 from top, shown in red) hits about 9% cycles.

Would anyone help me because I cannot understand why this "simple instruction" cost a so much time? Maybe it's related to pipeline ordering which has dependencies to other instructions? Thanks in advance!

BTW: the program was compiled with below command on x86_64 architecture:

gcc -g -pg -m32 -o test test.c

Below is the CPU information:

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 63
model name      : Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
stepping        : 2
microcode       : 0x1
cpu MHz         : 2494.222
cache size      : 16384 KB
physical id     : 0
siblings        : 1
core id         : 0
cpu cores       : 1
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology eagerfpu pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm invpcid_single ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt arat md_clear spec_ctrl intel_stibp
bogomips        : 4988.44
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

It's probably getting the "blame" for `cmpl` being slow to produce the result, e.g. because of the store/reload bottleneck created by using a debug build that keeps variables in memory. I'd recommend against spending too much time profiling a debug build, except maybe if you want to learn about CPU architecture and why exactly debug builds are such garbage for performance. ([Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?](https://stackoverflow.com/q/53366394)) — Peter Cordes, Sep 08 '20 at 08:10
I don't know if there's a good canonical Q&A covering the relevant issues here, like "skew" in perf events, and also how CPUs choose which instruction to blame out of all the instructions that were in flight when the "cycles" event chose to take a sample. Pipelined superscalar out-of-order exec makes that tricky. — Peter Cordes, Sep 08 '20 at 08:47
Is there also a dependency on the load of `%eax` four instructions earlier? — Nate Eldredge, Sep 08 '20 at 19:36
Thanks, @PeterCordes. Your answer seems reasonable. I reran it today with a optimized version, and I didn't see such long delays but have other observations. Below code: `2.59 | mov 0x804c910,%eax` `0.01 |179: mov %esi,(%eax)` `7.50 | add $0x4,%eax` Line 3 took a long time(7.5), I think it's because line 2 still uses %eax so it has to wait. But line 2 only took 0.01 so I'm again confused, but as you said, pipelined superscalar out-of-order exec makes that tricky. Another run showed "2.70,0.01,5.12" for the 3 instructions. — Steven Ding, Sep 09 '20 at 15:06
@NateEldredge, thanks - there is dependency 2 instructions earlier, further earlier, the instructions are storing data into %eax: `mov $0x0,%eax`. It's in AT&T x86 assembly format. :) — Steven Ding, Sep 09 '20 at 15:13
The instruction is `mov $0x0, %ax`, not `%eax`. It's a 16-bit mov and thus the high 16 bits are preserved. So as I understand it, the dependency is not removed; the data stored by `mov -0x18(%ebp), %eax` still affects the result of `setne %al`. — Nate Eldredge, Sep 10 '20 at 01:08
@NateEldredge, `mov -0x18(%ebp), %eax` stores value to %eax. And setne just set the %al, which is the lowest 8-bit of %eax. I have no idea why you think it may impact `setne`? — Steven Ding, Sep 10 '20 at 01:29
Please don't edit answers into the question. That edit with general answers would work if you posted it as an actual answer. — Peter Cordes, Apr 16 '21 at 03:17

score 0 · Answer 1 · answered May 08 '21 at 08:08

0

Just trying to provide a non-accurate answer here:

"perf" works based on samples. At each sample, it checks the current EIP value and record it.
The percentage for an instruction just refers to the samples when EIP shows the address compared to the total samples of the scope. When a previous instruction is slow, EIP just stays here.
For some modern CPUs, sometimes the reported hot spot may be a few instruction ahead of the real "blocking point". So it's usually good to look back to see if there's any instructions may cause a delay in the execution.

References: https://perf.wiki.kernel.org/index.php/Tutorial#Sampling_with_perf_record

answered May 08 '21 at 08:08

Steven Ding

61
8

For CPUs with out-of-order exec, it's not so much "the previous instruction" as "the instruction producing the result you're waiting for". So `setcc` gets the blame for waiting for FLAGS from cmp. It happens that was the instruction right before that, but an independent `mov` or something could execute and leave the blame to fall on setcc. – Peter Cordes May 08 '21 at 09:29
Also, IIRC, skew effects normally blame *later* instructions than the real stall point, not earlier. – Peter Cordes May 08 '21 at 09:31
And of course the most important takeaway from the whole question is that debug builds have *different* bottlenecks from normal optimized builds. Store/reload creates a ton of latency. [C loop optimization help for final assignment (with compiler optimization disabled)](https://stackoverflow.com/q/32000917) – Peter Cordes May 08 '21 at 09:32
Related: [Unexpectedly slow performance on moving from register to frequently accessed variable](https://stackoverflow.com/q/76773814) is a simpler case, just a store waiting for an L3-miss load+add, with the store getting all the "blame". – Peter Cordes Jul 26 '23 at 18:36

Why "setne %al" used "a lot of cycles" in perf annotation?

1 Answers1