How can I search for intel intrinsic functions in timing tables?

Question

I've looked through the sse wiki and x86 wiki, and there appear to be several great references for looking up either specific intel intrinsic functions or the latencies of assembly instructions on various processor architectures. Intel's intrinsics guide seems to have some latencies and throughputs listed for certain processor architectures and instructions, but it doesn't appear nearly as comprehensive as the uop tables. On the flip side, I'm struggling to find specific intel instructions in the uop tables. For example, searching for loadu (specifically _mm_loadu_pd) doesn't yield any hits in the uop tables. How can I look up latency and throughput information for intel intrinsics without having an encyclopedic knowledge of assembly codes?

You can look up the instruction corresponding to the intrinsic in the intrinsics guide and then look up that instruction in uops. Or better, compile your function with intrinsics to assembly code and look up the generated instructions. — chtz, Feb 16 '22 at 17:27
To add up to @chtz comment, specifically `_mm_loadu_pd` is documented to compile to `movupd`, but compilers may pick `movdqu` instead (an integer eqivalent) or `movapd` (aligned equivalent, a compiler may pick it if the source is aligned), or make an operand of another instruction to be memory, and not produce any instruction solely for the load, so compiler output is the best bet. — Alex Guteniev, Feb 16 '22 at 19:17

score 2 · Answer 1 · answered Feb 16 '22 at 20:08

The intrinsics guide shows you the asm mnemonic that it nominally corresponds to, e.g. movupd for _mm_loadu_pd. That's at the right hand column of the search results, and in the info box when you expand there a line Instruction: movupd xmm, m128.

But I'd recommend looking at how your C compiles to asm (e.g. https://godbolt.org/), then look those up. That will account for compilers optimizing your code, e.g. clang using different shuffles. And critically in this case, folding a load into a memory source operand for another instruction, like vaddpd xmm0, xmm0, [rdi].

(AVX allows folding unaligned loads; legacy SSE requires memory operands to be aligned so it can only fold _mm_load_pd, not loadu. This is why there's still a benefit to telling the compiler your data is aligned on Nehalem and later, if that's always true.)

But note that it's not meaningful to just add up throughput numbers of instructions that compete for different throughput resources (different execution ports). See

Also re: latency of load/store, see store-forwarding latency vs. L1d load-use latency. [Adding a redundant assignment speeds up code when compiled without optimization](https://stackoverflow.com/q/49189685) re: store-forwarding quirks and its variable latency on SnB vs. [What are the costs of failed store-to-load forwarding on x86?](https://stackoverflow.com/q/46135369) for store-forwarding slow-path (e.g. scalar stores / vector reload). — Peter Cordes, Feb 16 '22 at 23:23

score 1 · Answer 2 · answered Feb 16 '22 at 22:49

For instructions which load or store data from/to memory, you won’t find that data in any tables. Very complicated, depends on more things than just CPU core. Impossible to put in a table.

There’s limit in count of load and store operations cores can do per cycle. The limit is shared between the two hyper threads. On most CPUs, it doesn’t matter how many bytes an instruction is moving, 1 or 32. In most cases, there’re penalties for loads/stores which crossing a cache line boundary (cache lines are 64 bytes, aligned by 64).
Then there’re various cache latencies. Cache throughput is essentially unlimited i.e. even L3 is fast enough to handle two 32 byte loads every cycle by each core in parallel, but latency increases with cache levels.

For the above things, wikichip mentions some numbers for many popular CPUs, here’s for AMD Zen 2.

Then there’s off-chip memory, on modern computers often DDR4. Not only that thing is slow in terms of latency, but bandwidth is not great either. The data for bandwidth is in Wikipedia. The “transfer” is equal to 8 bytes, and is multiplied by count of channels. So, for dual-channel DDR4 memory, 1 transfer = 16 bytes of data, aligned by 16. The throughput scales about linearly with frequency * count of channels, but for latency there’s more. Memory modules aren’t equal, different DDR4-3200 modules may have latency numbers differ by a factor of 1.5.

And on top of that, there’re many special cases.

There’s a thing called TLB, translation lookaside buffer, it caches mappings from virtual to physical page. If a memory transaction crosses a page boundary (4kb on modern Windows), that thing may cause penalties.

CPU cores implement a cache coherency protocol. That thing makes loading from a cache line which was recently modified by another CPU core very expensive, on most CPUs even slower than a cache miss and a roundtrip to off-chip memory.

Some AMD CPUs have multiple chiplets (rebranded into core complexes? not sure), bandwidth across chiplets is lower than a bandwidth between cores and L3 cache on the same chiplet.

And that’s not even a complete list, there’re also power state shenanigans, ECC, DRAM refresh cycles, and more. For these reasons, unlike instructions which compute something, for loads and stores both latency and throughput are moot points with too many unknowns.

How can I search for intel intrinsic functions in timing tables?

2 Answers2