ARMv8 floating point output inline assembly

Question

For adding two integers, I write:

int sum;
asm volatile("add %0, x3, x4" : "=r"(sum) : :);

How can I do this with two floats? I tried:

float sum;
asm volatile("fadd %0, s3, s4" : "=r"(sum) : :);

But it gives me an error:

Error: operand 1 should be a SIMD vector register -- `fadd x0,s3,s4'

Any ideas?

score 2 · Answer 1 · answered Jan 03 '19 at 23:30

2

Because registers can have multiple names in AArch64 (v0, b0, h0, s0, d0 all refer to the same register) it is necessary to add an output modifier to the print string:

On Godbolt

float foo()
{
    float sum;
    asm volatile("fadd %s0, s3, s4" : "=w"(sum) : :);
    return sum;
}

double dsum()
{
    double sum;
    asm volatile("fadd %d0, d3, d4" : "=w"(sum) : :);
    return sum;
}

Will produce:

foo:
        fadd s0, s3, s4 // sum
        ret     
dsum:
        fadd d0, d3, d4 // sum
        ret

answered Jan 03 '19 at 23:30

James Greenhalgh

2,401
18
17

1

Nice, I figured there was probably a modifier while writing my answer, but I didn't see it in the GCC manual. Is this documented anywhere? I only know of the modifiers for x86 registers being in the gcc manual, at the bottom of the Extended asm section: https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#x86-Operand-Modifiers – Peter Cordes Jan 04 '19 at 02:49
@PeterCordes The problem is that the gcc people are reluctant to document these things. If they're documented, then they're not allowed to change them (which they probably don't do much anyway). One could argue that there are already enough people using them that changing them would cause wide-spread consternation, but that has not yet proven to be a good enough argument to overcome the inertia (but feel free to try!). x86 got doc'ed cuz I was changing the asm docs and added them, and no one felt strongly enough about it to argue with me. I only did x86 cuz that's what I know. Sorry. – David Wohlferd Jan 04 '19 at 03:24
1

As an aside to James and OP: Be aware that since these modifiers AREN'T doc'ed, using them is unsupported. As with any undocumented feature, gcc can change them at any time. They probably *won't*, but they can. – David Wohlferd Jan 04 '19 at 03:26
1

If you wanted to document the subset of modifiers that we shouldn’t change, I’ll approve the patch on list. Some (these, %w0 for printing the w rather than x name) should just be documented and fixed. Especially where behavior is consistent between GCC and Clang. To my great shame, the best documentation we have is at https://github.com/gcc-mirror/gcc/blob/master/gcc/config/aarch64/aarch64.c#L7259 – James Greenhalgh Jan 04 '19 at 13:34
@PeterCordes - Is there any way you can take James up on his offer here? I'd love to help, but I know almost zero about arm, so selecting the subset is beyond what I can offer. I could help with the texinfo, but that's probably the least challenging part of this. – David Wohlferd Jan 04 '19 at 23:40
@DavidWohlferd: I don't really know ARM / AArch64 that well, but probably well enough to do this if I get around to it. I understand `x` vs `w` integer regs, and I think all the iterations of ARM FPUs have used the same FP register names if they support them at all. I know that q0 = d1:d0 = s3:s2:s1:s0 in 32-bit mode (like x86's AX=AH:AL), but in 64-bit mode each `s0..31` is the low 32 of separate 64 / 128-bit registers with no partial-register shenanigans. – Peter Cordes Jan 05 '19 at 00:03
@PeterCordes - If you produced a list of recommendations and emailed it to James (pretty sure that's his email in gcc's MAINTAINERS list), he might be willing to give you some feedback. In addition to James' link showing the complete list of gcc's modifiers, there's also [this](http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.100067_0610_00_en/yfm1517569442821.html) for armclang. Someone thought at least those specific ones were important enough to doc. Whether any of the rest are useful for anything beyond 'instruction generation' for port maintainers, I couldn't say. – David Wohlferd Jan 05 '19 at 02:36
Sorry to be passing the buck! That is exactly what has stopped me doing this myself, trying to figure out what the minimum set we should document should be. Some are clearly useless (m, M can’t be triggered from assembly, they print based on RTL codes), some I think programmers will use in a different way (constants you’d probably write in the asm text rather than use N), some are clearly useful in contexts like this question (w, x, c, b, h, s, d). – James Greenhalgh Jan 05 '19 at 16:40
Like you, I'd rather do it myself than dump it on Peter. But Peter talks to (way) more users than I do (plus he speaks arm), so hopefully he'll have a better feel for what people might actually need. Based on the links above, I'm thinking we include wxcbhsd + cn to start with, since that gets us parity with the armclang docs. And I'm with you that most of the constant stuff (epPx) can probably be omitted. I'm pretty dubious about CDNk too, but since I can't rule them out, that leaves 12 from your list (CDHNSRkALGyz). Ideally we keep the number small, since adding later is easier than removing. – David Wohlferd Jan 05 '19 at 23:50
Ah, I missed your link to the Arm Compiler documentation. There's no reason we couldn't pick the same set as a starting place. – James Greenhalgh Jan 07 '19 at 18:08
@PeterCordes - I tried sending an email to the Canadian domain that bears your name. Was that not right? I'd like to have your input here, but if you don't have the time/interest, James and I will do our best without you. – David Wohlferd Jan 14 '19 at 20:00
@DavidWohlferd: yeah, I saw your mail, I haven't gotten around to it yet because of other distractions. I'll see if I can get to it soon. – Peter Cordes Jan 14 '19 at 20:04

Peter Cordes · Answer 2 · 2018-12-31T00:25:54.543

"=r" is the constraint for GP integer registers.

The GCC manual claims that "=w" is the constraint for an FP / SIMD register on AArch64. But if you try that, you get v0 not s0, which won't assemble. I don't know a workaround here, you should probably report on the gcc bugzilla that the constraint documented in the manual doesn't work for scalar FP.

On Godbolt I tried this source:

float foo()
{
    float sum;
#ifdef __aarch64__
    asm volatile("fadd %0, s3, s4" : "=w"(sum) : :);   // AArch64
#else
    asm volatile("fadds %0, s3, s4" : "=t"(sum) : :);  // ARM32
#endif
    return sum;
}

double dsum()
{
    double sum;
#ifdef __aarch64__
    asm volatile("fadd %0, d3, d4" : "=w"(sum) : :);   // AArch64
#else
    asm volatile("faddd %0, d3, d4" : "=w"(sum) : :);  // ARM32
#endif
    return sum;
}

clang7.0 (with its built-in assembler) requires the asm to be actually valid. But for gcc we're only compiling to asm, and Godbolt doesn't have a "binary mode" for non-x86.

# AArch64 gcc 8.2  -xc -O3 -fverbose-asm -Wall
# INVALID ASM, errors if you try to actually assemble it.
foo:
    fadd v0, s3, s4 // sum
    ret     
dsum:
    fadd v0, d3, d4 // sum
    ret

clang produces the same asm, and its built-in assembler errors with:

<source>:5:18: error: invalid operand for instruction
    asm volatile("fadd %0, s3, s4" : "=w"(sum) : :);
                 ^
<inline asm>:1:11: note: instantiated into assembly here
        fadd v0, s3, s4
             ^

On 32-bit ARM, =t" for single works, but "=w" for (which the manual says you should use for double-precision) also gives you s0 with gcc. It works with clang, though. You have to use -mfloat-abi=hard and a -mcpu= something with an FPU, e.g. -mcpu=cortex-a15

# clang7.0 -xc -O3 -Wall--target=arm -mcpu=cortex-a15 -mfloat-abi=hard
# valid asm for ARM 32
foo:
        vadd.f32        s0, s3, s4
        bx      lr
dsum:
        vadd.f64        d0, d3, d4
        bx      lr

But gcc fails:

# ARM gcc 8.2  -xc -O3 -fverbose-asm -Wall -mfloat-abi=hard -mcpu=cortex-a15
foo:
        fadds s0, s3, s4        @ sum
        bx      lr  @
dsum:
        faddd s0, d3, d4        @ sum    @@@ INVALID
        bx      lr  @

So you can use =t for single just fine with gcc, but for double presumably you need a %something0 modifier to print the register name as d0 instead of s0, with a "=w" output.

Obviously these asm statements would only be useful for anything beyond learning the syntax if you add constraints to specify the input operands as well, instead of reading whatever happened to be sitting in s3 and s4.

See also https://stackoverflow.com/tags/inline-assembly/info

@DavidWohlferd: I meant that without input constraints, this toy inline asm is only useful for learning the syntax, not doing anything useful. To go beyond that, you need input constraints like `"w" (input1)`. — Peter Cordes, Dec 30 '18 at 23:52
@DavidWohlferd: I think a valid parsing of my sentence is that change is required for the statements to be useful for anything other than learning/testing the syntax. In my last edit I changed the wording of the rest of the sentence to try to make that clearer, but feel free to edit if you're still convinced it's confusing. Probably you aren't the only one that parsed it differently from how I intended. — Peter Cordes, Dec 31 '18 at 05:20
Opened a bug for the 32-bit problem at: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89482 — Ciro Santilli, Feb 24 '19 at 10:15

Ciro Santilli · Accepted Answer · 2019-02-28T08:58:22.010

ARMv7 double: %P modifier

GCC devs informed me the correct undocumented modifier for ARMv7 doubles at https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89482#c4 Maybe I should stop being lazy and grep GCC some day:

main.c

#include <assert.h>

int main(void) {
    double my_double = 1.5;
    __asm__ (
        "vmov.f64 d0, 1.0;"
        "vadd.f64 %P[my_double], %P[my_double], d0;"
        : [my_double] "+w" (my_double)
        :
        : "d0"
    );
    assert(my_double == 2.5);
}

Compile and run:

sudo apt-get install qemu-user gcc-arm-linux-gnueabihf
arm-linux-gnueabihf-gcc -O3 -std=c99 -ggdb3 -march=armv7-a -marm \
  -pedantic -Wall -Wextra -o main.out main.c
qemu-arm -L /usr/arm-linux-gnueabihf main.out

Disassembly contains:

   0x00010320 <+4>:     08 7b b7 ee     vmov.f64        d7, #120        ; 0x3fc00000  1.5
   0x00010324 <+8>:     00 0b b7 ee     vmov.f64        d0, #112        ; 0x3f800000  1.0
   0x00010328 <+12>:    00 7b 37 ee     vadd.f64        d7, d7, d0

Tested in Ubuntu 16.04, GCC 5.4.0, QEMU 2.5.0.

Source code definition point

ARMv8 floating point output inline assembly

3 Answers3

Linked