2

For speed, I would like to read one of 8 registers referenced by the value in a 9th register. The fastest way I see to do this is to use 3 conditional jumps (checking 3 bits in the 9th register). This should have shorter latency than the standard way of doing this with an offset memory read, but this still requires at least 6 clock cycles (at least one test plus one conditional jmp per bit check).

Is there any commercial CPU (preferably x86/x64) with an intrinsic to do this "offset register read" with a latency of just one clock cycle?

In theory, an optimized CPU could do this with one addition and one move, so two or one clock cycles seems easy...is there some general reason that architectures don't care about speeding up an offset read for a small array?

bobuhito
  • 285
  • 2
  • 20
  • 2
    Treating the CPU registers as an array is really not a common approach these days. The last architecture I know that allowed this was the PDP11 and it died out in the late 80s. Why don't you put your array into some memory location like any other array? – fuz Feb 03 '20 at 14:57
  • 2
    Accessing a register by variable index more or less defeats register renaming, so it's a very unlikely feature for a fast processor. You can sort of index an YMM register using VPERMD, maybe that's still OK for your use case? – harold Feb 03 '20 at 14:59
  • 1
    @fuz The AVR can do this too; the registers are mapped into the address space. – Erlkoenig Feb 03 '20 at 15:01
  • @fuz If you read my question again, I mention offset memory read (L1 cache) is much slower than register reads. – bobuhito Feb 03 '20 at 15:03
  • 2
    CPUs don't have intriniscs; compilers have intrinsics. Perhaps you mean "machine instruction" or "addressing mode" (which a compiler might have an intrinsic for)? As previous commenters say, most ISAs can't index the registers with a runtime index, only with indexes embedded in the machine code of an instruction. (So register fetch could happen any time after decode, without any extra indirection as part of the ISA.) – Peter Cordes Feb 03 '20 at 15:05
  • 1
    @bobuhito Sure, but due to out of order execution, this extra latency usually does not matter too much. It sure beats three conditional jumps in a row (or one computed jump). – fuz Feb 03 '20 at 15:05
  • So to answers your question: no, I don't think this is possible. – fuz Feb 03 '20 at 15:06
  • You could runtime-generate an instruction that reads the appropriate register. However, flushing the caches to make this work will be very slow... – Erlkoenig Feb 03 '20 at 15:06
  • 4
    Self modifying code is much more expensive that merely flushing a line from the data cache; on modern x86 it flushes the whole pipeline (perf counter event `machine_clears.smc`). Fine if you JIT once and run many times, total garbage if you do it every time. Store-forwarding latency is only about 3 to 5 cycles on modern Intel, similar to L1d load-use latency of 4 or 5 cycles. What's the larger problem you're trying to solve where you need low *latency* array indexing? Can you solve it with AVX2 shuffles instead? That's 3c latency (plus more for `vmovd` of an index from integer regs) – Peter Cordes Feb 03 '20 at 15:11
  • @bobuhito Any progress on this one? I'd really like to improve my answer, but I cannot do so without knowing your specific application: I cannot solve your problem of indexing registers, but it should be possible to write a fast solution for your problem that doesn't need this. – fuz Feb 16 '20 at 23:06

1 Answers1

4

Treating the CPU registers as an array is really not a common approach these days. The last architecture I know that allowed this was the PDP11 and it died out in the late 80s. Why don't you put your array into some memory location like any other array?

That said, you could use a computed jump. This also replaces a data dependency (indexed addressing mode) with a control dependency so out-of-order exec doesn't have to wait for the index input to even be ready before it can start running code that uses the final RAX. Of course this assumes correct branch prediction, which is unlikely if the index changes often. A branch mispredict costs many cycles of little work being done, but the small latency of a load that hits in L1d cache can overlap with independent work very easily.

The throughput cost is higher than an array in memory: some address computations, one jump, one move and a ret, instead of just a mov or even a memory operand with an indexed addressing mode.

To inline this code, simply replace the jmp *%rax with a call *%rax, costing another uop. Or replace the ret instructions with a jmp to a label at the bottom and increase the stride of the jump table to 8 to account for the longer encoding.

    # select a register from r8...r15 according to the value in rdi
select:
    lea labels-4*8(%rip),%rax # rdi = 8 is the first jump table entry
    lea (%rax,%rdi,4),%rax    # pointer to the appropriate entry
    jmp *%rax                 # computed jump

    .align 4
labels:
    mov %r8, %rax
    ret

    .align 4
    mov %r9, %rax
    ret

    .align 4
    mov %r10, %rax
    ret

    .align 4
    mov %r11, %rax
    ret

    .align 4
    mov %r12, %rax
    ret

    .align 4
    mov %r13, %rax
    ret

    .align 4
    mov %r14, %rax
    ret

    .align 4
    mov %r15, %rax
    ret

While this is probably faster than three conditional jumps (depending on access pattern), it surely won't beat just using an array.

You may also be able to use code like this, assuming the index is in eax. This works by copying the index bits into CF, SF, and PF and then using a bunch of ALU operations to distinguish them:

    imul $0x4100, %eax, %eax
    lahf

    # bit 0
    mov %r8, %rax
    cmovc %r9, %rax
    mov %r10, %rcx
    cmovc %r11, %rcx
    mov %r12, %rdx
    cmovc %r13, %rdx
    mov %r14, %rbx
    cmovc %r15, %rbx

    # bit 1
    cmovs %rcx, %rax
    cmovs %rbx, %rdx

    # bit 2
    cmovp %rdx, %rax

The result obtains in %rax. Due to the high instruction-level parallelism and lack of branches in this code, it should perform better than the code above, unless the index is almost always the same.

(stolen from this answer).

fuz
  • 88,405
  • 25
  • 200
  • 352
  • Good idea. I wish there were a C++ way which generated assembly like this. Also, I can't believe (yet?) that this can be slower than L1 cache for random access. – bobuhito Feb 03 '20 at 15:20
  • @fuz: you should probably incorporate your comment about ISAs not having indirect addressing for registers. – Peter Cordes Feb 03 '20 at 15:25
  • @bobuhito: Latency isn't everything. OoO exec hides latency very well most of the time, so it's not rare for the front-end to be a bottleneck for well-tuned code. Or more often cache misses and/or branch mispredicts. Branch misses could easily be killer for this code unless the index is very predictable (e.g. highly correlated with earlier branching). – Peter Cordes Feb 03 '20 at 15:28
  • @bobuhito It's going to be a bit slower. Assuming you inline this (by turning `jmp *%rax` into `call %rax`) we have on SkylakeX a latency of 3+1+2+0+1 = 7 which is more than the latency of 2 for a simple load from memory. Note that the latency for my code is likely going to be even higher because the jump target of the computed jump cannot be predicted in general. – fuz Feb 03 '20 at 15:28
  • @bobuhito Note that your C++ compiler is likely to produce very similar assemble (though probably with an explicit jump table) if you use a dense select statement. – fuz Feb 03 '20 at 15:32
  • @fuz: That's the latency before the back end can check the prediction for the `jmp`/`call *%rax`. **But control latency is decoupled from data latency by speculative execution.** The performance costs of this vs. array indexing are on different axes, nearly orthogonal, except in the mispredict case where latency to the index being ready (RDI input) does get coupled to data latency by having to detect a branch mispredict. (But the RIP-relative LEA is still independent). And of course the huge cost of discarding all later work done in the shadow of that mispredict. – Peter Cordes Feb 03 '20 at 15:33
  • @fuz I thought latency to load from memory was more like 4 clock cycles, but ok. I'd like to get this all down to 1 clock cycle with a different CPU, so let me leave this question open a while. – bobuhito Feb 03 '20 at 15:35
  • 3
    @bobuhito Agner's tables say so for Skylake X. Your mileage may vary. If you told us what algorithm you try to implement using this approach, we might be able to propose even better solutions that break out of the “using registers as arrays” idea. – fuz Feb 03 '20 at 15:37
  • 1
    @bobuhito: yes, L1d load-use latency is 4 cycles (best case) on SnB-family. Or 5 cycles outside of the [pointer-chasing special case](https://stackoverflow.com/questions/52351397/is-there-a-penalty-when-baseoffset-is-in-a-different-page-than-the-base) for simple addressing modes where the value came from another load. But the core doesn't stall for those 5 cycles!! Almost always there's *some* independent work to do. **The throughput of loads is 2 per cycle** (on modern x86, AMD and Intel). – Peter Cordes Feb 03 '20 at 15:47
  • @bobuhito: What kind of problem are you working on where using an 8-bit microcontroller like AVR to get 1-cycle indexing of registers (via their memory-mapped address) is a viable alternative to a modern x86-64 chip like Skylake or Ryzen? And BTW, AVR's `ld` instruction take 2 cycles on normal AVR, or maybe 1.5 on xmega when loading from internal SRAM? https://www.microchip.com/webdoc/avrassembler/avrassembler.wb_LDD_Z.html – Peter Cordes Feb 03 '20 at 15:54
  • @PeterCordes I never mentioned AVR so am confused by your question. Anyway, the underlying problem is that I want to keep the values in 8 registers always sorted while 2 of these values step from 0 to 255 (with the other 6 values being constants between 0 and 255). How would you step this as quickly as possible? – bobuhito Feb 03 '20 at 16:35
  • 2
    @bobuhito Hm... I'd use a state machine for that. What do you plan to do with the other values that do not increment? Clearly, this is not the whole picture. Perhaps post some pseudo code of the entire operation you want to implement (and not just the tiny part you want to use indirect addressing for). – fuz Feb 03 '20 at 16:37
  • @bobuhito User erlkoenig mentioned AVR. I suppose Peter just got confused there. – fuz Feb 03 '20 at 16:37
  • @fuz "State machine" is a bit too abstract for me to work with - Let's just stick with your Skylake X. All 8 values are used in a calculation which is very fast (3 clock cycles if values are always sorted in the fixed registers). I'm then summing the "calculations" to get an average (over the 2 values which step). – bobuhito Feb 03 '20 at 16:43
  • 1
    @bobuhito Your problem description too is a bit to abstract for me to work with. Best would be if you could add a self-contained piece of code to your question that demonstrates what you are trying to achieve. I can then try and rewrite it into fast assembly. If all you want to do is performing some simple calculations, consider if you actually need to perform all 256 iterations or if you can use some sum formula instead. – fuz Feb 03 '20 at 16:48
  • @bobuhito: You said "I'd like to get this all down to 1 clock cycle with a different CPU". So I assumed you were talking about the other ISAs that can do this in 1 instruction, like AVR. Of course, some CPUs can index L1d cache in 1 cycle, like early MIPS or various low clock speed modern chips like some ARM, I think. Especially ones without virtual memory. (When the clock speed is slower, you don't need to break things up into as many pipeline stages. But performance in seconds is what really matters.) – Peter Cordes Feb 03 '20 at 16:51
  • @bobuhito: If 4 or 5 cycle latency for a single uop operation on Skylake is too slow for you (or 0 extra front-end uops if you use it as a memory source for another ALU instruction), it's unlikely that any other CPU can solve your overall problem any faster in nanoseconds. Generally Skylake cores are the fastest single-threaded performance around when you take clock speed into account, not just instructions (or real work) per clock. (Ryzen will be similar for this problem, and has a wider front-end.) – Peter Cordes Feb 03 '20 at 16:55
  • @bobuhito: For incrementing values you don't have to bottleneck on latency, either L1d load-use or store-forwarding. A loop that embeds the keep-sorted logic could maybe avoid actually reloading a lot of the time. And there's definitely going to be some instruction-level parallelism to hide latency when you do reload. So total focus on latency from index to array element makes zero sense. You do know the difference between latency and throughput, right? And that you can't just add up costs for different instructions to get a total time on modern CPUs. – Peter Cordes Feb 03 '20 at 16:57