2

On Cortex-A processors (AArch64 mode) is there some rule of a thumb for optimization for speed? Like it's always better to read from memory, than do a branch?

Consider the simplest conversion to hexadecimal string as example:

convert:
    . . .
    cmp x9, 9
    b.le . + 8
    add x9, x9, 0x07
    add x9, x9, 0x30
    strb w9, [x10, -1]!
    . . .
    b convert

vs

convert:
    . . .
    ldrb w9, [x11, x9]    ; x11 - ptr to alphabet string: "0123456789ABCDEF"
    strb w9, [x10, -1]!
    . . .
    b convert

Thanks in advance for any tips.

Alexander Zhak
  • 9,140
  • 4
  • 46
  • 72
  • 2
    In this particular example, you can use `csel` and have the best of both worlds - no memory read and no branch. – Nate Eldredge Jan 30 '21 at 19:19
  • 2
    In general, I think the answer would depend heavily on whether this was called in a tight enough loop that you could rely on your lookup table staying in L1 cache, and also on how accurately the conditional branch could be predicted. – Nate Eldredge Jan 30 '21 at 19:27
  • 2
    A tiny lookup table (less than 1 cache line) like that is generally fine, especially when it saves an unpredictable branch. Even saving a couple instructions like add/csel can be worth it if hot in L1d. Although note that for this specific case, you'd likely be better off using SIMD instead of a scalar loop, and thus using branchless techniques like a masked add, like in my x86 SSE2 version: [How to convert a binary integer number to a hex string?](https://stackoverflow.com/q/53823756). Or using `tbl` for SIMD byte lookups, like the SSSE3 pshufb version; I think AdvSIMD / NEON can do that. – Peter Cordes Jan 30 '21 at 21:28

0 Answers0