4

Suppose I have a short array v of say 8 int64_t. I have an algorithm that needs to access different elements of that array, which are not compile-time constants, e.g. something like v[(i + j)/2] += ... in which i and j are variables not subject to any kind of constant propagation.

Ordinarily I’d keep the array I memory, calculate the array index, load the array from memory in that position, and then store the result.

But suppose that, for valid reasons which I won’t go into, I want to keep the full array in registers -- the array is size-limited and fits the register bank.

If I were just reading from, and not writing to, the array, I could use (in ARMv8 NEON) the TBL instruction to perform table lookups. But what about the case of writing?

All I can think of is self-modifying code, encoding the array index directly into the instructions and executing it. I know this carries performance penalties when first running, but it might even work if the same code were executed over and over again.

Other than that, any ideas? Is it even possible? I reviewed the parts relevant to the instruction set and encoding of the ARMv8 architecture reference manual, and so far I’m inclined to say no, but maybe someone knows an obscure instruction or addressing mode that would help here.

swineone
  • 2,296
  • 1
  • 18
  • 32
  • 3
    There might be a way to write to the array using a `TBL` instruction with a custom-crafted index vector, but it's going to be a lot slower than just going through memory. Eight 64-bit integers can be loaded at once with `LD4`, so overwriting one array element in memory and then reloading the whole thing in one go might be good enough if writes only happen occassionally. – fuz Jan 27 '23 at 11:59
  • 1
    Very few ISAs have a way to do run-time-variable indexing (indirect access) of register-numbers. AVR microcontrollers do memory-map their architectural registers, along with a few other very early or low-end-microcontroller type ISAs. Within a single register, shuffle instructions like TBL can be effective for reading, and maybe broadcast + a blend for writing an element if there isn't an insert instruction. – Peter Cordes Jan 28 '23 at 00:29

1 Answers1

3

If you want to access x8, then there is no other way than to have an instruction that encodes x8 as a source register. So outside of emitting instructions at runtime, the only index-based solution I can come up with is to have a stub for each register, and branch based on the index like a switch-case. Assuming your array spans x8 through x15:

.p2align 2 // maybe change this to align to cacheline size?
read_reg:
    adr x2, 1f
    add x2, x0, lsl 3
    br x2
1:
    mov x0, x8
    ret
    mov x0, x9
    ret
    mov x0, x10
    ret
    mov x0, x11
    ret
    mov x0, x12
    ret
    mov x0, x13
    ret
    mov x0, x14
    ret
    mov x0, x15
    ret

Writing would work the same way. This of course has the chance of messing up branch predictions. One other "hack" I can think of to not use branches at all is to combine csel with a direct move to the nzcv system register:

.p2align 2
read_reg:
    lsl x0, x0, 28
    msr nzcv, x0
    csel x4, x8, x9, vc // bit 0
    csel x5, x10, x11, vc
    csel x6, x12, x13, vc
    csel x7, x14, x15, vc
    csel x4, x4, x5, lo // bit 1
    csel x5, x6, x7, lo
    csel x0, x4, x5, ne // bit 2
    ret

This could be extended to a maximum of 16 registers. I'm not too certain on the performance constraints of the msr or whether it requires an isb on some architectures though - on an Apple M1 at least, it doesn't. And the case for writing wouldn't be as compact, as you need at the very least 8 instructions to target each register. :/

Siguza
  • 21,155
  • 6
  • 52
  • 89
  • 2
    Neat trick with `lsl`/`msr`. If it's slow, at worst three `tst` instructions could take their place. Or start with `tst` on the low bit, then `lsls` to get one bit into C (shifted out) and another into N (MSB). – Peter Cordes Jan 28 '23 at 00:36
  • 1
    On sufficiently recent ARM64 chips, you can use an `RMIF` instruction for the first two. Your jump table approach is very similar to [this one](https://stackoverflow.com/a/60042077/417501). – fuz Jan 28 '23 at 02:55