0

I want to execute the assembler instruction vrhadd in C++ using __asm__. I need to map the following function to ARM-specific instructions (using vrhadd), since the given codebase was developed on x86-64 and im using ARM64.

__asm__("vpavgb %[a], %[b], %[c]" : [c] "=x" (res) : [a] "x" (a), [b] "x" (b));

Where a, b and c are 256-bit SIMD register. Executing this line on my system throws:

error: couldn't allocate output register for constraint 'x' , because (I guess) x, as the constraint of the input operands, stands for a 256-bit vector operand in an AVX register (x86). On ARM it represents a 32, 64, or 128-bit floating-point/SIMD register in the ranges s0-s15, d0-d7, or q0-q3, respectively.

Since I could not find one, I was wondering if there is a direct equivalent of the x86-constraint x for ARM64?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
terdev
  • 57
  • 1
  • 8
  • On x86, an `x` constraint is either XMM (128-bit), YMM (256-bit), or ZMM (512-bit), depending on the type of the variable you use with it. https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html. Or from [In GNU C inline asm, what are the size-override modifiers for xmm/ymm/zmm for a single operand?](https://stackoverflow.com/q/34459803) - clang used to(?) only accept `"x"` for vectors 128-bit vectors? Not sure when `"Yt"` and so on are needed. Anyway, I don't expect any of that to be the same for ARM. – Peter Cordes Mar 22 '22 at 00:37
  • Maybe [GCC What's the right inline assembly constraint to operate with ARM VFP instructions?](https://stackoverflow.com/q/12610375) . And I think [gcc arm inline assembler %e0 and %f0 operand modifiers for 16-byte NEON operands?](https://stackoverflow.com/q/51476786) is showing a working example. – Peter Cordes Mar 22 '22 at 00:41
  • Erm... you do realize that the NEON vector unit is only 128 bits wide? So it makes sense that there wouldn't be any constraint for a 256-bit vector - there is no register that could hold it, and no way for the machine to process it. You're going to have to split up your 256-bit vectors into two 128-bit chunks and operate on them separately. (Or is your question about the proper constraint for a 128-bit vector as the closest "equivalent"?) – Nate Eldredge Mar 22 '22 at 07:03
  • 5
    On the other hand, why use inline asm here at all? ARM/ARM64 SIMD has perfectly good [intrinsics](https://developer.arm.com/architectures/instruction-sets/intrinsics/vrhaddq_u8); that would avoid the whole issue as well as enabling better optimization. [Obligatory link to DontUseInlineAsm](https://gcc.gnu.org/wiki/DontUseInlineAsm) – Nate Eldredge Mar 22 '22 at 07:10
  • @NateEldredge Uhhhhh.... Sorry, I don't agree. I could show you many examples proving the opposite. Especially compilers for `aarch32` generate FUBAR machine codes when doing permutations (`vtrn`, `vzip`, and `vuzp`). But you are right on the other hand: inline assembly is only good for short inline functions. – Jake 'Alquimista' LEE Mar 26 '22 at 03:46

0 Answers0