1

I have a piece of assembly code analyzed for Skylake in uiCA

Throughput (in cycles per iteration): 5.25
Bottleneck: Issue

The following throughputs could be achieved if the given property were the only bottleneck:

  - DSB: 3.67
  - Issue: 5.25
  - Ports: 5.00
  - Dependencies: 2.00

M - Macro-fused with previous instruction

┌───────────────────────┬────────┬───────┬───────────────────────────────────────────────────────────────────────┬───────┐
│ MITE   MS   DSB   LSD │ Issued │ Exec. │ Port 0   Port 1   Port 2   Port 3   Port 4   Port 5   Port 6   Port 7 │ Notes │
├───────────────────────┼────────┼───────┼───────────────────────────────────────────────────────────────────────┼───────┤
│              1        │   1    │   2   │                             0.51      1                         0.49  │       │ push r15
│              1        │   1    │   2   │                                       1                          1    │       │ push r14
│              1        │   1    │   2   │                             0.27      1                         0.73  │       │ push rbx
│              1        │   1    │   1   │                                                         1             │       │ cmp esi, 0x63
│                       │        │       │                                                                       │   M   │ jnle 0x2a
│              1        │   1    │       │                                                                       │       │ mov rbx, rdi
│              1        │   1    │   1   │  0.2      0.31                                0.36     0.13           │       │ movsxd r15, esi
│              1        │   1    │   1   │            1                                                          │       │ lea r14, ptr [rip+0x21]
│              1        │   1    │   1   │           0.64                                0.36                    │       │ lea rsi, ptr [rbx+r15*1]
│              1        │   1    │       │                                                                       │       │ mov rdi, r14
│              1        │   1    │       │                                                                       │       │ xor eax, eax
│              2        │   2    │   3   │  0.58              0.24               1                0.42     0.76  │       │ call 0x5
│              1        │   1    │   1   │           0.47                                0.53                    │       │ lea rax, ptr [r15+0x2]
│              1        │   1    │   1   │  0.4      0.07                                0.27     0.27           │       │ cmp r15, 0x62
│              1        │   1    │       │                                                                       │       │ mov r15, rax
│              1        │   1    │   1   │                                                         1             │       │ jl 0xffffffffffffffe7
│              1        │   1    │   1   │                    0.49     0.51                                      │       │ pop rbx
│              1        │   1    │   1   │                    0.51     0.49                                      │       │ pop r14
│              1        │   1    │   1   │                    0.49     0.51                                      │       │ pop r15
│              2        │   2    │   9   │  1.11     0.84      2        2        1       1.16     0.89           │       │ ret 
├───────────────────────┼────────┼───────┼───────────────────────────────────────────────────────────────────────┼───────┤
│             21        │   21   │  28   │  2.29     3.33     3.73     4.29      5       2.67     3.71     2.98  │       │ Total
└───────────────────────┴────────┴───────┴───────────────────────────────────────────────────────────────────────┴───────┘

But I have no idea on what everything means (Googling also didn't give any useful stuff). So can someone explain me what each thing in the output means (Like what is DSB, Issue, Ports, Dependencies...), and how can I compare this code with another code?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Cyao
  • 727
  • 4
  • 18
  • 2
    DSB is the uop cache: Decode Stream Buffer. To understand this, you have to know how modern CPUs work; go read Agner Fog's microarch guide (https://agner.org/optimize/), at least the Skylake chapter. Also relevant, https://www.realworldtech.com/sandy-bridge/. And [What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?](https://stackoverflow.com/q/51607391) Also semi-related [What is IACA and how do I use it?](https://stackoverflow.com/q/26021337) but it doesn't try to explain uops and ports either. – Peter Cordes Feb 18 '23 at 21:00
  • 2
    Normally you wouldn't use uiCA on a whole function body including the push/pop and ret, especially without a matching `call`. It treats the whole thing as a loop body, assuming that it ends with a jump to the top, or is fully unrolled. I'm not sure https://uops.info/'s model of `ret` is accurate for the correctly-predicted case; I think they have too many back-end uops (for the unfused domain = execution ports and scheduler aka RS). `call __isoc99_scanf@PLT` is also super weird for static analysis: it can't see the callee so is only counting the uops for the `call` itself. – Peter Cordes Feb 18 '23 at 21:05
  • Thanks a lot for the info! I am just wanting to compare two C code's performance (they both execute in 0.01s so cant use time), so the best way I have is to see the analysis made by uiCA, thus I just pasted the whole function into uiCA :P. Having `call __isoc99_scanf@PLT` is not a big deal for me since all my functions have it. Also could you tell me which column should I look at for the amount of uops a instruction used? – Cyao Feb 18 '23 at 21:25
  • 1
    "Issued" is fused-domain uops (front-end), "Exec" is unfused-domain. e.g. `call` is 2 front-end uops, a jump and a push. The push in the back-end is a store-address (ports 2,3, or 7) and a store-data (can only run on port 4), so 3 total unfused-domain uops. The stack engine handles the RSP-=8 without a uop. – Peter Cordes Feb 18 '23 at 21:28
  • 2
    If the bottleneck had been dependency chains, that part of the analysis would be meaningless because of the function call but not actually counting the code inside `scanf`. `scanf` is not very fast; the time for it will make differences in the caller very minor. It's like you're carefully measuring the tips of some icebergs and ignoring the part under water. uiCA can count uops for you, and fewer in the loop that calls scanf is generally better, so that's what you should be analyzing, not the whole function. **uiCA assumes the whole region you give it is a loop body.** – Peter Cordes Feb 18 '23 at 21:32
  • 2
    But like I said, `scanf` is going to almost totally dominate the run time, so any significant optimization of your problem is going to involve calling scanf less often. Like maybe getting a whole line and parsing it. Perhaps even just `sscanf` to avoid the stdin lock/unlock overhead of `scanf` could help, if there are multiple things per line? – Peter Cordes Feb 18 '23 at 21:35

0 Answers0