How do I convert 32-bit NEON assembly to 64-bit?

Question

I am trying to use MSFA (googles music synth) on 64-bit iOS devices, and it has four NEON assembly source files for DSP operations that are apparently written for 32-bit architectures. I was initially told that it would be better to rewrite this as NEON intrinsics so that it would be architecture-agnostic. However, after reading some articles (such as http://hilbert-space.de/?p=22), it appears that it still is ideal to have this as pure hand written assembly.

My question is, is it trivial to convert this to 64-bit? If so, how would I get started doing this?

The .s files are:

https://github.com/google/music-synthesizer-for-android/blob/master/cpp/src/neon_fir.s

https://github.com/google/music-synthesizer-for-android/blob/master/cpp/src/neon_fm_kernel.s

https://github.com/google/music-synthesizer-for-android/blob/master/cpp/src/neon_iir.s

https://github.com/google/music-synthesizer-for-android/blob/master/cpp/src/neon_ladder.s

See http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.den0024a/ch07s03.html and the other pages in that part of ARM's docs for some stuff about the changes to asm syntax for NEON, and the change to the register file. (thirty-two 128b NEON registers, instead of the big regs being composed of pairs of smaller regs.) See also [discussion here](http://stackoverflow.com/questions/38035351/rgba-to-abgr-inline-arm-neon-asm-for-ios-xcode/38040651?noredirect=1#comment63616468_38040651) about vector operand syntax. — Peter Cordes, Jul 01 '16 at 16:17

Peter Cordes · Accepted Answer · 2022-12-03T11:54:26.063

TL;DR: use intrinsics

It's not a bad idea to check the asm output to make sure it's not dumb, but using intrinsics lets compilers do constant-propagation, and schedule / software-pipeline for in-order cores.

If you read the comment thread on that post from 2009 you linked, you'd see that the bad code from NEON intrinsics was a gcc bug fixed in 2011.

Compilers are quite good at handling intrinsics these days, and continually improving. Clang especially can do quite a lot, like use different shuffle instructions than what you wrote with intrinsics.

At least they are for x86; compilers for ARM still sometimes struggle with intrinsics, especially when trying to access the two 8-byte halves of a 16-byte vector like you often want to in 32-bit ARM code for horizontal operations. See ARM NEON intrinsics convert D (64-bit) register to low half of Q (128-bit) register, leaving upper half undefined / NEON intrinsic for sum of two subparts of a Q register - Jake Lee reports that as recently as 2018, some clang versions made a total mess out of it, but GCC6.x was not as bad.

This might not be as much of a problem with AArch64.

asm-level differences:

I'm not at all an expert on this, but one of the major NEON changes is that Aarch64 has thirty-two 128b NEON registers (v0 - v31), instead of each q register aliasing onto two d halves.

See also some official ARM documentation about syntax for element-size, where you can use .16B to indicate a vector of 16 byte elements. (As opposed to the old syntax where .8 meant each element was 8 bits.)

Thanks-- Do you have any suggestions on how I would go about converting these 4 files into intrinsics? — patrick, Jul 01 '16 at 17:35
@patrick: nothing specific beyond the obvious "make C functions", but you'll probably have to rewrite them by hand. I barely know NEON; I just picked up some bits of it from the same tags that I follow for x86 asm and SSE/AVX questions. — Peter Cordes, Jul 01 '16 at 17:47
The NEON intrinsics have a more-or-less 1:1 mapping with the equivalent AArch32 instructions, which should permit a relatively mechanical transformation of the vector parts. Unpicking the scalar stuff (loop counters, conditional parts, function prologue/epilogue boilerplate, etc) back to C++ code is probably the more involved aspect, but the plain C++ and SSE intrinsic versions ought to provide relatively useful points of reference. — Notlikethat, Jul 01 '16 at 22:40

How do I convert 32-bit NEON assembly to 64-bit?

1 Answers1

asm-level differences: