Treating the CPU registers as an array is really not a common approach these days. The last architecture I know that allowed this was the PDP11 and it died out in the late 80s. Why don't you put your array into some memory location like any other array?
That said, you could use a computed jump. This also replaces a data dependency (indexed addressing mode) with a control dependency so out-of-order exec doesn't have to wait for the index input to even be ready before it can start running code that uses the final RAX. Of course this assumes correct branch prediction, which is unlikely if the index changes often. A branch mispredict costs many cycles of little work being done, but the small latency of a load that hits in L1d cache can overlap with independent work very easily.
The throughput cost is higher than an array in memory: some address computations, one jump, one move and a ret
, instead of just a mov
or even a memory operand with an indexed addressing mode.
To inline this code, simply replace the jmp *%rax
with a call *%rax
, costing another uop. Or replace the ret
instructions with a jmp
to a label at the bottom and increase the stride of the jump table to 8 to account for the longer encoding.
# select a register from r8...r15 according to the value in rdi
select:
lea labels-4*8(%rip),%rax # rdi = 8 is the first jump table entry
lea (%rax,%rdi,4),%rax # pointer to the appropriate entry
jmp *%rax # computed jump
.align 4
labels:
mov %r8, %rax
ret
.align 4
mov %r9, %rax
ret
.align 4
mov %r10, %rax
ret
.align 4
mov %r11, %rax
ret
.align 4
mov %r12, %rax
ret
.align 4
mov %r13, %rax
ret
.align 4
mov %r14, %rax
ret
.align 4
mov %r15, %rax
ret
While this is probably faster than three conditional jumps (depending on access pattern), it surely won't beat just using an array.
You may also be able to use code like this, assuming the index is in eax
. This works by copying the index bits into CF, SF, and PF and then using a bunch of ALU operations to distinguish them:
imul $0x4100, %eax, %eax
lahf
# bit 0
mov %r8, %rax
cmovc %r9, %rax
mov %r10, %rcx
cmovc %r11, %rcx
mov %r12, %rdx
cmovc %r13, %rdx
mov %r14, %rbx
cmovc %r15, %rbx
# bit 1
cmovs %rcx, %rax
cmovs %rbx, %rdx
# bit 2
cmovp %rdx, %rax
The result obtains in %rax
. Due to the high instruction-level parallelism and lack of branches in this code, it should perform better than the code above, unless the index is almost always the same.
(stolen from this answer).