Using LDRD in GNU C inline asm? What constraints to use?

Question

TL;DR I'm playing around with easm and burned my fingers. Do my constraints make sense?

As I am playing around with memory, I wanted to test reading some memory manually on an ARM CPU (cortex A9)

(Disclaimer: Learning purpose here, of course I agree that relying on an optimizer is 99.999% of the time the right thing to do but I would really like to understand why everything explodes here).

On the concerned hardware:

is the bus CPU - Memory 64 bits wide, so I'm trying to use the ldrd instruction to load two 32b words at once.
The data in memory is 128 bits aligned, so, let's use two times the ldrd instruction.

My problem is that, the generated assembler / generation attempt does not make sense, and this independently from:

Compiler (tested with GCC and clang)
Optimization level (tested with -O0 -Og -O2 -O3)
Cross / native (tested with arm-linux-gnueabihf-gcc and native gcc)

Here is a minimal example demonstrating the issue:

#include <stdint.h>


// custom structure: represent 128 bits
typedef struct __attribute__ ((packed)) u128
{
  uint32_t a;
  uint32_t b;
  uint32_t c;
  uint32_t d;
} u128;



int main(void)
{
  uint32_t *ptr = (uint32_t*) 0xdeadbeef; // For test purpose, just a random location in memory
  u128 words;

  // 1st read: 64 bits
  asm volatile inline (
    "ldrd %[high_32b], %[low_32b], [%[addr]], #8"
    : [high_32b] "=X" (words.a), [low_32b] "=X" (words.b)
    : [addr] "r" (ptr));

  // 2nd read: 64 bits
  asm volatile inline (
    "ldrd %[high_32b], %[low_32b], [%[addr]], #8"
    : [high_32b] "=X" (words.c), [low_32b] "=X" (words.d)
    : [addr] "r" (ptr));

  return 0;
}

GCC

arm-linux-gnueabihf-gcc -Wall -Wextra -O3 -g -ggdb broken_asm.c -o broken_asm /tmp/ccIaxiTz.s: Assembler messages: /tmp/ccIaxiTz.s:51: Warning: base register written back, and overlaps one of transfer registers

disassembly (radare2 -A -c 's sym.main; pdf' broken_asm)

│ 0x000003da f3e80221 ldrd r2, r1, [r3], 8
| 0x000003de f3e80232 ldrd r3, r2, [r3], 8 ; broken_asm.c:27 asm volatile inline (

So, yes indeed, the warning makes sense: The ldrd r3, r2, [r3], 8 seems broken

(expected: sources != destination. For instance: ldrd r3, r2, [r4], 8)

Clang

clang -mtune=cortex-a9 --target=arm-linux-gnueabihf -isystem /usr/arm-linux-gnueabihf/include -Wall -Wextra -O3 -g -ggdb broken_asm.c -o broken_asm

broken_asm.c:22:5: error: Rt must be even-numbered "ldrd %[high_32b], %[low_32b], [%[addr]], #8" ^ :1:11: note: instantiated into assembly here ldrd r1, r2, [r0], #8 ^ broken_asm.c:28:5: error: base register needs to be different from destination registers "ldrd %[high_32b], %[low_32b], [%[addr]], #8" ^ :1:11: note: instantiated into assembly here ldrd r0, r1, [r0], #8 ^ 2 errors generated.

So, let's read some error messages:

base register needs to be different from destination registers

OK, comparable issue as with GCC (and yes, it more feel like an error than a warning)

error: Rt must be even-numbered

Wait what? ldrd r1, r2 ... The first operand must, indeed be an even register and the second one, the following odd register.

From the ARM Instructions Reference:

Rt: The first destination register. For an ARM instruction, must be even-numbered and not R14.

Rt2: The second destination register. For an ARM instruction, must be <R(t+1)>.

I am pretty sure I'm doing something in EASM wrong (as it's actually nearly the only effective lines of code, it's not so hard to guess).

Here is my constraints understanding so far:

Output:

The registers if which I would like the output are, as far as I understand, write only.

‘=’ identifies an operand which is only written

I started with "g" as a constraint (same effect) but opted for "X" to give the might compiler more freedom:

'X' Any operand whatsoever is allowed.

Input:

I'm using "r" as I would like in both ldrd to read from the same register. I also tried with "X" but got the same issue.

'r' A register operand is allowed provided that it is in a general register.

Some notes as this post is too short :/

Host: Linux (Debian)
Target: Zynq 7000 (PS side: Cortex A9)
Clang --version: Debian clang version 11.0.1-2
cross gcc: arm-linux-gnueabihf-gcc (Debian 10.2.1-6) 10.2.1 20210110
native gcc: gcc (Debian 10.2.1-6) 10.2.1 20210110
Tweaking a binary to manually set registers in op-codes seems to work as intended

So, I genuinely have no idea what I'm doing wrong here. Any pointer welcomed.

*"relying on an optimizer is 99.999% of the time the right thing to do".* Whoever said that to you, don't take it as a fact. It really depends on the task you're given. Sometimes, you don't have to care about optimization at all. Sometimes, you can't depend on anything to optimize for you. A lot of people can't do better than an automated optimizer, so they should rely on the optimizer *99.999% of the time*. — xiver77, Jul 13 '22 at 09:09
The first two operands of `ldrd` has to be registers, so using `=r` is correct. Using `=X` means anything (memory, vector register, etc.) can be an operand. — xiver77, Jul 13 '22 at 09:21
@xiver77: `"=r"` isn't sufficient, though. `ldrd` requires a pair of adjacent registers whose number differs only in the low bit. e.g. `r0, r1` or `r4, r5`. (Unlike ARMv8 `ldp` which can take an arbitrary pair of registers.) — Peter Cordes, Jul 13 '22 at 09:27
This is also using a read-only input constraint (not `"+r"`) for the pointer, but a write-back addressing mode. So it's modifying that input without telling the compiler about it. Also missing [How can I indicate that the memory \*pointed\* to by an inline ASM argument may be used?](https://stackoverflow.com/q/56432259) — Peter Cordes, Jul 13 '22 at 09:29
https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html doesn't list an ARM constraint for a register pair that satisfies the `ldrd` requirement. Searching in the GCC source's https://github.com/gcc-mirror/gcc/blob/master/gcc/config/arm/constraints.md doesn't immediately find anything relevant, so IDK how GCC itself goes about picking registers in order to make use of `ldrd` when generating asm from C, or if there is a way to do that for inline asm constraints. — Peter Cordes, Jul 13 '22 at 09:33
Interestingly, [stp aarch64 instruction must be used with "non-contiguous pair of registers"](https://stackoverflow.com/q/25195861) says `ldrd` in Thumb2 doesn't have the same requirements on the register pair. So if you can build for `-mthumb`, that could let you just use `"=r"` outputs. — Peter Cordes, Jul 13 '22 at 09:36
@PeterCordes Is [this](https://godbolt.org/z/Pnnrf7jb4) a valid approach with the pair requirements? — xiver77, Jul 13 '22 at 09:42
@xiver77: Yes, if you take away GCC's freedom to do register allocation (e.g. picking call-preserved ones if it wants to inline into a loop with function calls), you can nail down the choices to valid ones. You forgot a `"memory"` clobber, though, or a dummy memory source operand. (And note that it chooses to vectorize the store to memory of the return-value struct using NEON registers, so it would have been more efficient to let it load using a 16-byte NEON load. Unless maybe the values were stored with separate narrow stores, and `ldrd` avoids a store forwarding stall.) — Peter Cordes, Jul 13 '22 at 09:49
Thank you both for pointing out the constraint problems. I actually checked some assembler generated by GCC (no warning; intended code behaviour; no hand-written ASM) which uses `ldrd` as such: `ldrd r0, r3, [ip]` or `ldrd r4, r8, [structure.map]`. Are these instructions correct? I also read that the registers must be "contiguous". I probably missed something but might be able to come with a minimal working example demonstrating it if it helps. — Zermingore, Jul 13 '22 at 09:51

score 4 · Accepted Answer · answered Jul 14 '22 at 00:41

GCC generally supports the same inline assembly features as armclang, though unfortunately the GCC manual does not document them. In the armclang docs, you can read:

If you use a 64-bit value as an operand to an inline assembly statement in A32 or 32-bit T32 instructions, and you use the r constraint code, then an even/odd pair of general purpose registers is allocated to hold it. This register allocation is not guaranteed for the l or h constraints.

Using the r constraint code enables the use of instructions like LDREXD/STREXD, which require an even/odd register pair. You can reference the registers holding the most and least significant halves of the value with the Q and R template modifiers.

So loading two registers with ldrd could look like:

#include <stdint.h>

uint64_t get_pair(void *ptr) {
    uint64_t result;
    asm("ldrd %Q[pair], %R[pair], [%[addr]]"
        : [pair] "=r" (result)
        : [addr] "r" (ptr)
        : "memory");
    return result;
}

Try on godbolt

If you want to extract the two halves separately, you can follow the asm block with something like

uint32_t lo, hi;
lo = result; // conversion truncates
hi = result >> 32;

With optimizations enabled, you can be confident that the compiler will just store the high-half register and not actually execute a shift. This is a common idiom that compilers recognize.

There are a couple other issues with the code in your question:

You are using the writeback post-increment addressing mode which modifies your address register, but you do not inform the compiler of that. You would need to make your addr an input-output operand: list it with the outputs and use the +r constraint. But keep in mind that this is pointless unless you actually use the updated value later in the code; if not, then just use a non-writeback addressing mode.
By default the compiler assumes your asm statement does not read any memory, so memory writes could be reordered past it. asm volatile does not prevent this; it only prevents the compiler from omitting the asm entirely when it thinks its outputs are unused.

A memory clobber as in my example above is the simplest and crudest way to do this; it tells the compiler that the asm statement may read or write arbitrary parts of memory, and so no other memory accesses may be reordered past it. Better is an m input operand, with a variable whose type has the size that is to be read; I won't bother with it here, but see How can I indicate that the memory *pointed* to by an inline ASM argument may be used? for more information.

Rather than having your reads in two separate asm statements and using the writeback mode to change the pointer between them, I would put them in a single asm statement and skip the writeback addressing altogether. Here's my attempt (try on godbolt):

  uint64_t hi64, lo64; 
  asm inline("ldrd %Q[lo], %R[lo], [%[addr]] \n\t"
             "ldrd %Q[hi], %R[hi], [%[addr], #8]"
             : [lo] "=&r" (lo64), [hi] "=r" (hi64)
             : [addr] "r" (ptr)
             : "memory");
  words.a = lo64;
  words.b = lo64 >> 32;
  words.c = hi64;
  words.d = hi64 >> 32;

Note the "earlyclobber" & modifier on the lo operand, indicating that it is written before all the inputs are read. Without this, the compiler might use one of the lo registers for addr, in which case it would be overwritten by the first ldrd instruction, and the second one would break. However, we did not use & on hi; since addr is not needed after hi is written, it is okay if they use the same register.

Cool, so there is support for register-pairs in asm constraints. Re: your last example: if you're not using write-back, you can use a memory source operand, like `[mem] "m" (*ptr)`. Of course, that dereference needs to actually be valid in pure C (alignment / aliasing rules apply, so you might need an `__attribute__((may_alias))` typedef). Then you might as well break up the `ldrd`s into two asm statements, letting the compiler choose the addressing mode both times, and letting it maybe reuse some registers (e.g. by storing or doing math after the first load, before the second.) — Peter Cordes, Jul 14 '22 at 02:52
@PeterCordes: One thing I've never been clear about with the `m` constraint on ARM is which addressing modes it may use, given that different instructions permit different options. If operand 0 is `m`, could `%0` expand to, say, `[r1, #2048]`? That would be valid for `ldr %1, %0` but not for `ldrd %Q1, %R1, %0`, and definitely not for `lda %1, %0`. Or is it always the least-common-denominator `[rN]`? Then loading adjacent doublewords in separate asm statements will actually require an `add r1, r1, #8` instruction in between. — Nate Eldredge, Jul 14 '22 at 04:28
Ah, right, some instructions support fewer addressing modes or a smaller range. GCC handles that with machine-specific constraints like `Uv` (*A memory reference suitable for VFP load/store insns (reg+constant offset)*) or `Uq` (*A memory reference suitable for the ARMv4 ldrsb instruction.*) Unfortunately https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html doesn't mention one for ldrd, but if you're already looking in the source or clang docs, maybe you'd find one. So basically it's a similar design to `"i"` allowing any constant integer vs. "J" (-4095..4095) or "I" (8-bit rotated). — Peter Cordes, Jul 14 '22 at 04:35
GCC documents AArch64 constraints: `Q` (*A memory address which uses a single base register with no offset*) and `Ump` (*A memory address suitable for a load/store pair instruction in SI, DI, SF and DF modes*). And there's the generic `"o"` constraint for "offsetable", i.e. it has room for a non-zero constant displacement. — Peter Cordes, Jul 14 '22 at 04:38
@PeterCordes; It seems that for `m`, GCC decides based on operand type. So with `unsigned char *p;`, we have `"m" (*(int32_t *)(p + 2048))` expanding to `[r0, #2048]`, but change `int32_t` to `int64_t` and you get plain `[r1]` with `add r1, r0, #2048` preceding. https://godbolt.org/z/boz9xahr9 Thus in fact `m` with a 64-bit variable probably does always have an expansion safe for `ldrd / strd`. — Nate Eldredge, Jul 14 '22 at 04:42

Using LDRD in GNU C inline asm? What constraints to use?

1 Answers1

Linked

Related