2

I am asking if mov instructions that need to compute that address i.e (in at&t syntax

mov i(r, r, i), reg or mov reg, i(r, reg, i)

have to be executed on port 1 because they are effectively an LEA w/ 3 operands + MOV or if they are free to be executed on port 0156.

If they do indeed execute the LEA portion on port 1, will port 1 be unblocked once the address computation is complete or will the entire memory load need to complete first.

On ICL it seems p7 can do indexed address mode?

#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>


#define BENCH_ATTR __attribute__((noinline, noclone, aligned(4096)))


#define TERMS 3

void BENCH_ATTR
test_store_port() {
    const uint32_t N = (1 << 29);

    uint64_t dst, loop_cnt;
    uint64_t src[16] __attribute__((aligned(64)));

    asm volatile(
        "movl %[N], %k[loop_cnt]\n\t"
        ".p2align 5\n\t"
        "1:\n\t"

        "movl %k[loop_cnt], %k[dst]\n\t"
        "andl $15, %k[dst]\n\t"
#if TERMS == 3
        "movl %k[dst], (%[src], %[dst], 4)\n\t"
#else
        "movl %k[dst], (%[src])\n\t"
#endif


        "decl %k[loop_cnt]\n\t"
        "jnz 1b\n\t"
        : [ dst ] "+r"(dst), [ loop_cnt ] "+r"(loop_cnt)
        : [ N ] "i"(N), [ src ] "r"(src), "m"(*((const uint32_t(*)[16])src))
        : "cc");
}

int
main(int argc, char ** argv) {
    test_store_port();
}

Results with #define TERMS 3:

perf stat -e uops_dispatched.port_2_3 -e uops_dispatched.port_7_8 -e uops_issued.any -e cpu-cycles ./bsf_dep

 Performance counter stats for './bsf_dep':

           297,191      uops_dispatched.port_2_3                                    
       537,039,830      uops_dispatched.port_7_8                                    
     2,149,098,661      uops_issued.any                                             
       761,661,276      cpu-cycles                                                  

       0.210463841 seconds time elapsed

       0.210366000 seconds user
       0.000000000 seconds sys

Results with #define TERMS 1:

perf stat -e uops_dispatched.port_2_3 -e uops_dispatched.port_7_8 -e uops_issued.any -e cpu-cycles ./bsf_dep

 Performance counter stats for './bsf_dep':

           291,370      uops_dispatched.port_2_3                                    
       537,040,822      uops_dispatched.port_7_8                                    
     2,148,947,408      uops_issued.any                                             
       761,476,510      cpu-cycles                                                  

       0.202235307 seconds time elapsed

       0.202209000 seconds user
       0.000000000 seconds sys
Noah
  • 1,647
  • 1
  • 9
  • 18

1 Answers1

2

All CPUs do address-generation for load / store uops on AGUs in the load or store-address ports, not on the ALU ports. Only LEA ever uses the ALU execution ports for that shift-and-add math.

If complex addressing modes needed port 1, https://uops.info/ and/or https://agner.org/optimize/ would say so in their instruction tables. But they don't: loads only need p23, and stores only p237 for store-address + p4 for store-data.


Actually just p23 for indexed stores; the simple store-address AGU (Haswell through Skylake) on port 7 can only handle reg+constant, meaning address-generation can be a bottleneck if you use indexed addressing modes in code that could otherwise sustain 2 loads + 1 store per clock.

(Early Sandybridge-family, SnB and IvB, would even un-laminate indexed stores, so there was a front-end cost, too.)

Ice Lake changed that, with 2 dedicated store AGUs on ports 7 and 8. Store-address uops can't borrow load AGUs anymore, so the store AGUs have to be full featured. https://uops.info/html-tp/ICL/MOV_M32_R32-Measurements.html confirms that stores with indexed addressing mode do run at 2/clock on ICL, so both of the store AGUs are full-featured. e.g. mov [r14+r13*1+0x4],r8d. (uops.info didn't test a scale factor > 1, but I'd assume both the store-AGUs are identical in which case they'd both handle it.)

Unfortunately it will be many years before HSW/SKL aren't important for tuning, since Intel is still selling Skylake-derived microarchitectures so they'll be a large part of the installed base for years, for desktop software.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Thanks for the update. Where did you get the information about the exception? – Noah Sep 21 '20 at 03:42
  • @Noah From the links in my answer, especially Agner Fog's instruction tables, and confirmed (in the past, didn't re-test for this answer) with `perf stat` with events like `uops_dispatched_port.port_1` which you can see is `0` when you're only executing loads, not ALU instructions, even if they're complex indexed addressing modes like `[rdi + rcx*4 + 1234]`. Also https://www.realworldtech.com/haswell-cpu/. Related: https://github.com/travisdowns/uarch-bench/wiki/Intel-Performance-Quirks but that only covers more subtle less widely-known stuff. – Peter Cordes Sep 21 '20 at 03:55
  • 1
    Can you take a look at that example. Unless you see something wrong with it, it seems on icelake port 7 can do indexed address calculation as well. (I know beyond scope of initial question). Larger question I was trying to answer (and believe I did) is assuming no issue of port contention for address generation is there any reason (aside from code size) to choose simple address mode as opposed to indexed address mode: Nope. – Noah Feb 17 '21 at 19:13
  • 1
    @Noah: yes, Ice Lake changed that. Store-address uops can't borrow load AGUs anymore, so the store AGUs have to be full featured. (Or at least one of them does; have you tested that it can sustain 2/clock indexed stores (narrower than 64-byte each so store-data isn't a bottleneck)? I'd hope yes, but Haswell / SKL did choose to leave that out from their port 7 store AGU for no apparent reasons, definitely cramping optimization.) Of course, if you ever care about your code running on the large installed-base of older CPUs, you'd still optimize accordingly. – Peter Cordes Feb 17 '21 at 23:22
  • 1
    @Noah: nevermind, https://www.uops.info/html-tp/ICL/MOV_M32_R32-Measurements.html already measured that. Scroll down for indexed addressing modes: such stores do run at 2/clock on ICL. e.g. `mov [r14+r13*1+0x4],r8d`. They apparently didn't test *scaled* index, but I'd hope that's the same. Especially if there aren't separate counters for port 7 vs. port 8, hopefully they're identical. – Peter Cordes Feb 17 '21 at 23:25
  • confirmed 2 stores/cycle with scaled index. I'm wondering. Is there an intuition for why on skylake ```leaq``` with indexed address mode has 3c latency compared to 1c latency in simple address mode whereas load/store have the same latency irrelivant of address mode? Is it because load/store have 5c latency irrelivant so AGU can always keep up or does it actually use different type of execution units for AGU compared to lea? – Noah Feb 18 '21 at 02:21
  • @Noah: First of all, `lea rax, [rdi + rsi*4]` *is* a "simple" LEA (1c latency) on Skylake (but not AMD). A non-zero shift count is only a problem on AMD. Only 3-component LEA (2 `+` operations) are slow. This makes a bit of sense I guess. They would just be 2 cycles, but SnB-family standardizes latencies so there are no 2c single uops (to make scheduling easier for HW that has to avoid creating write-back conflicts), and all 3-cycle integer instructions use the ALU on port1; the integer ALUs on other ports are presumably only 1 stage long, hence limited throughput for slow LEA. – Peter Cordes Feb 18 '21 at 02:29
  • @Noah: But yes, good question about AGUs. Perhaps 2 adds are only just barely too slow to manage in 1 cycle, and the pipeline stages in the load port can be placed at more convenient spots? Address-generation has to be finished before checking the TLB. (Except in the 4-cycle load-use latency special case for `[reg + 0..2047]` addr modes where it [optimistically guesses that `reg` alone is in the same 4k page](https://stackoverflow.com/q/52351397) when reg is a load result. But AFAIK, no similar penalty if `[rdi + rsi*4 + 1234]` is in a different page than `[rdi + rsi*4]` or anything.) – Peter Cordes Feb 18 '21 at 02:35
  • wow, i've misunderstood this for a long time. Is the following then correct: simple address mode -> ```[reg + 0..2047]``` (optimistical 4c, otherwise 7 - 10c), base pointer address mode -> ```[reg + 2048...N``` 5c, index address mode -> ```[reg + reg0 * c]``` 5c, 3-term address mode -> ```reg + C0 + reg0 * C1]``` 5c. So all are 5c but simple address mode. Then ```lea``` has 1c for all of the above but 3-term which has 3c until icelake (then 1c)? – Noah Feb 18 '21 at 06:16
  • uops.info doesn't seem to have benchmarks for 3-term address mode ```mov``` but assuming its still 5c (thats what I measured) but does it have any other hidden costs (aside from steal p23 on skylake and older). – Noah Feb 18 '21 at 06:17
  • 1
    @Noah: those 5c latencies are load-use latencies (for pointer chasing like iterating a linked list or tree). I don't know how to measure store-address latency; it's hard to make it part of a dep chain. But yes, everything is always 5c total load-use latency, except the `[reg + 0..2047]` special case *when reg itself was the result of another load*. (Or on Ice Lake, which seems to have dropped that special case and is now always 5c.) But even pre-ICL, the CPU only tries to optimistically skip an AGU adder on the critical path when pointer chasing, the case where load latency is most critical – Peter Cordes Feb 18 '21 at 06:25
  • @Noah: So it's amusing to see descriptions of Ice Lake having "a slower L1d cache" and just saying the latency is 5c, up from 4, when the 4c latency was only ever a speculative special case. I mean maybe ICL's larger L1d actually is also slower, and itself can only barely keep up with 5c total latency, making the early-TLB speculative trick not useful. But IDK. (Indexing a set of the cache needs the correct low 12 bits of the final address, which could certainly be ready sooner than the full 64 due to less carry latency. The TLB result is needed for the tag comparators after reading cache.) – Peter Cordes Feb 18 '21 at 06:30
  • Hmm, a few newish words there. "tag comparators" -> full address for matching given lines in the set? "sooner than the full 64 due to less carry latency" you mean like take first 12 bits of output before the result of the result is ready? Does that actually happen? Is it exclusive to addresses computed with AGU or is there a mechanism for say getting the results early for the lower bits of an add instruction i.e (at&t) ```addq reg1, reg0; movl (reg0), reg1``` or just ```movl (reg0, reg1), reg1```? – Noah Feb 18 '21 at 06:52
  • I can kind of see how in an adder there is a partial result are certain points (don't know much about digital circuits though) but with like an xor I can't imagine the time spent coordinating the take the first 12 bits of the result being faster than just waiting for the other 52 to finish. – Noah Feb 18 '21 at 06:54
  • 1
    @Noah: There's no "coordinating" needed, you literally just wire up the low 12 output bits to the part that needs them as an input. Those low outputs will stabilize sooner than higher bits because carry doesn't have to propagate all the way, so you can design the part that needs the low 12 to start using them on an earlier clock edge than stuff that's using the full result. This is all internal to a load port; forwarding data between separate uops only happens for full values at clock cycle boundaries. (Except on early Pentium-4 where the narrow ALUs allowed effective half-cycle add latency) – Peter Cordes Feb 18 '21 at 07:05
  • 1
    @Noah: re: tag comparator: yes, you fetch tags + data for all ways of a set, and any `==` comparison that gets a hit for the tag part of the address will result in muxing the corresponding data to the output. https://courses.cs.washington.edu/courses/cse378/09wi/lectures/lec16.pdf#page=19 has an example diagram. – Peter Cordes Feb 18 '21 at 07:09