ARM Assembly, Evaluate Alternative Latencies, How Close Matters?

Question

When evaluating two alternatives to solve a problem, comparing clock-cycles and latencies, if both evaluate roughly the same, is there a better way to decide which to use?

Example - Converting to Hexstring

An example I was looking at involves converting an integer value to hex for output. There are two basic approaches:

use a simple lookup of the hex-digit from a string "0123456789abcdef" using ldr (roughly 3 clock-cycles); or
compare the remainder with 10 and either add '0' or 'W' which involves a cmp and then two conditional adds (e.g. addlo and addhs) which at roughly 1 clock-cycle each is again about 3 clock-cycles.

(using the rough latencies from the link in the answer from Instruction execution latencies for A53 -- there apparently isn't a good a53 specific latency reference)

Example - Hex Convert Loop Alternatives

The following is code for a cortex-a53 (raspberrypi 3B):

    hexdigits:  .asciz "0123456789abcdef"
    ...
        ldr     r8, hexdigitadr       /* load address for hexdigits */
    ...

    hexcvtloop:
        cmp     r6, 0                 /* separation of digits done? */
        beq     hexcopy               /* copy tmp string to address */
        udiv    r0, r6, r7            /* divide by base, quotient in r0 */
        mls     r2, r0, r7, r6        /* mod (remainder) in r2 */
        mov     r6, r0                /* quotient to value */
        
        /* alternative 1 - ASCII hexdigit lookup */
        ldrb    r2, [r8, r2]          /* hexdigit lookup */
        
        /* alternative 2 - add to obtain ASCII hexdigit */
        cmp     r2, 10                /* compare digit to 10 */
        addlo   r2, r2, '0'           /* convert to digits '0'-'9' */
        addhs   r2, r2, 'W'           /* convert to 'a'-'f' */
        
        strb    r2, [r5], 1           /* store in tmp string */
        b       hexcvtloop

Understanding the reference stated clock-cycles do not account for other factors, interrupts, memory speed, cache-misses, etc..

If my rough estimates of about 3 clock-cycles each for either the hex-digit lookup with ldr or the cmp, addlo, addhs for adding to the remainder is fair, is there another consideration that would decide between the two approaches, or is it basically personal preference at that point?

(I'm not overly concerned with getting a cortex-a53 specific answer, but am more interested in if there are other ARM general metrics I would look to next -- or if it's just "up to you" at this point)

Hopefully this loop doesn't run very often; most int->string conversions will use base 10 or 16 which are forth checking for specifically if you care about performance, so you can avoid `udiv`. (And for base 16 you can even generate digits in MSD-first order, instead of only LSD-first with div / mod so you don't have to reverse later; see x86 [How to convert a binary integer number to a hex string?](https://stackoverflow.com/q/53823756). Like SSSE3, NEON also has a byte shuffle you could use to do all the hex digits in parallel). — Peter Cordes, Oct 04 '21 at 05:37
But if we take this as just an example, without considering algorithmic optimizations like that, one consideration is trading off code-size vs. a data lookup table needing to stay hot in cache. Also, style note, `'W'` is a weird way to express `-10 + 'a'`. It's the right number (I assume) but the reason why has little to do with it representing capital W, so the semantic meaning is all wrong. — Peter Cordes, Oct 04 '21 at 05:39
BTW, you want `ldr r8, =hexdigitadr` to get the address in a reg. Or a manual `add` with `PC` if you put the table itself nearby, instead of the address in a literal pool. (And like I alluded to earlier, this generates digits in reverse order, but you're storing them forwards in memory.) — Peter Cordes, Oct 04 '21 at 05:44
Ah yes, the `'W'` is just a pull from the ASCII chart, and yes, it is `-10 + 'a'`. This arose when looking at several ways to approach this. The intent was "what do I look at if checking the latencies come up within one or two core clock-cycles -- or does that really matter at that point?" (for something non-critical). This loop is just an example loop and won't run more often than a student can type an integer in. — David C. Rankin, Oct 04 '21 at 05:45
My point with how often it runs was that when tuning, it's important to consider whether the a block is hot or not. If it's not, don't spend much time looking it it, or optimize for size (usually overall size including data), if speed is roughly equal. That means don't use the LUT, unless you can share it with an efficient loop that's specific to hex. — Peter Cordes, Oct 04 '21 at 05:47
On an in-order CPU like A53, latency isn't really separate from throughput, but the amount of software-pipelining you can use to hide it can matter, since the HW won't do it for you. Load and immediate store is not great. — Peter Cordes, Oct 04 '21 at 05:49
Now that's is a consideration I was looking for the "size" and LUT contributing to the size compared with what can be calculated. There wouldn't be anything else sharing the string, so it would be specific to this routine. It would also cost a fixed two bytes storage. While not much space, with a myriad of functions on an embedded device, if each took an additional 2 bytes for a convenient lookup that would add up. — David C. Rankin, Oct 04 '21 at 05:53
Besides overall size, there's cache footprint if called at all frequently. ldr is slower if it doesn't hit in L1d cache, much slower on high-end cores. (And data vs. code cache pressure. And iTLB vs. dTLB, assuming ARM doesn't have a unified first-level TLB). — Peter Cordes, Oct 04 '21 at 06:26
Peter, that answers my question. If you want to do a quick paragraph answer I can select from what you provided in the comments, we can close this one out. For the a53 specifically, your point about the software pipelining and the load and immediate store not being a great choice was helpful. The power-of-two optimizations for hex-convert and the link to you x86 answer was likewise great. The writeup there and multiple options for x86 captures a lot of the tradeoffs. — David C. Rankin, Oct 04 '21 at 06:34
Sure, took a few minutes to re-edit those comments into an answer, now that I know that's the kind of answer you were looking for. — Peter Cordes, Oct 04 '21 at 07:10

Peter Cordes · Answer 1 · 2021-10-04T07:11:36.387

Hopefully this general-case loop using udiv doesn't run very often, with special-case blocks handling base 10 and 16¹.

It's important to consider whether the a block is hot or not. If it's not, don't spend much time looking it it, or optimize for size (usually overall size including data) in ways that don't lose badly on speed. That means don't use the LUT. (Except that if you have a specialized int->hex loop, it could share the same LUT, e.g. loaded as one 16-byte SIMD vector.)

On an in-order CPU like A53, per-instruction latency isn't really separate from throughput, but the amount of software-pipelining you can use to hide it can matter, since the HW won't do it for you. Especially for load latency: Load and then store the result in the next instruction is something to avoid, since it'll probably stall for about the load-use latency.

You always want to find some work you can do between a load and the first use, if you care about your code ever running on in-order cores. That might mean rearranging the loop. For example, you can probably put the store after preparing the next digit. In this slow general case, that's a udiv which should be plenty slow to hide load-use latency. (High-performance in-order ARM cores still do have memory-level parallelism and scoreboard loads or something.)

There can be a tradeoff between code cache footprint vs. data cache footprint, too, depending on function alignment of surrounding functions. If this runs infrequently, it will likely miss in data cache when it does run, and then the ldr will be a lot more expensive than normal, especially on a high-end CPU. (High clock frequency means DRAM or even L2 latency is many CPU clocks).

Also, if there's no other "hot" data in the whole page where the LUT would have been, that's a dTLB entry that doesn't need to stay hot, or doesn't need to evict another one.

Footnote 1: Most int->string conversions will use base 10 or 16 which are worth checking for specifically if you care about performance, so you can avoid udiv. (Multiplicative inverse for base 10).

Base 16 is very special because it's a power of 2, and a whole number of 4-bit groups fills a 32-bit integer. You can even generate digits in MSD-first order because each digit depends only on those bits, not all higher bits, instead of only LSD-first for bases div / mod that you have to store in reverse order. See x86 asm How to convert a binary integer number to a hex string?. Like SSSE3, ARM NEON also has a byte shuffle you could use to do all the hex digits in parallel, adapting one of the strategies from the x86 SIMD parts of my linked answer.

Code review

Style: 'W' is a weird way to express -10 + 'a'. It's the right number, but the reason why has little to do with it representing capital W, so the semantic meaning is all wrong.

Also, you want ldr r8, =hexdigitadr to get the address in a reg, not load from that label address. Or a manual add with PC if you put the table itself nearby, instead of the address in a literal pool.

And like I alluded to earlier, this generates digits in reverse order, but you're storing them forwards in memory, so they'd need reversing.

ARM Assembly, Evaluate Alternative Latencies, How Close Matters?

1 Answers1

Code review