0

I have a C code which uses Neon Intrinsics which will run in a Raspberry Pi 4 (Cortex-A72). When I compile the code with the built-in GCC:

  1. In Raspberry Pi OS 32-bits (ARM - armv7l), if I run

    gcc -o test test.c -march=native -mcpu=native -mtune=native -O3 -g

    I get a performance of 1,100,000 ns (the number change with each test but it's more than 1 million).

  2. Still in Raspberry Pi OS 32-bits, if I run

    gcc -o test test.c -march=native -mcpu=native -mtune=native -mfpu=neon -O3 -g

    i.e., adding -mfpu=neon, I get a performance of 12,000 ns (because the code was designed to use the parallelism provided by vaddq_u32, veorq_u32 and others.

  3. Now, in Raspberry PI OS 64-bits (aarch64), if I run

    gcc -o test test.c -march=native -mcpu=native -mtune=native -O3 -g

    I get a performance of 1,100,000 ns as well.

But there is the problem. There is no -mfpu supported by GCC on aarch64, and supposedly SIMD+FP is on by default on aarch64, but the performance is still bad.

Things that I've tried:

  1. Build with the built-in GCC with -march=armv8-a+simd -mcpu=cortex-a72+simd and similar options for -march to use SIMD+FP, but the performance is still slow. (The list of options is here: https://gcc.gnu.org/onlinedocs/gcc/AArch64-Options.html)

  2. Use cross-compilers (like GCC-arm-none-elf and GCC-arm-none-eabi) from the ARM developer site. Some of their gcc versions have the -mfpu option, but either I get too many errors or the performance is the same. (https://developer.arm.com/tools-and-software/open-source-software/developer-tools/gnu-toolchain/downloads)

Is there (in the aarch64 built-in GCC) any command that forces to use SIMD+FP as -mfpu for 32-bits built-in GCC?

Thank you.

liwuen
  • 330
  • 2
  • 17
  • 2
    So you know that SIMD is enabled by default for AArch64. So your question is not clear. The "performance is still slow" is way too vague. You might want present your benchmark and ask about the performance instead what you asked. Related/duplicate: https://stackoverflow.com/questions/29851128/gcc-arm64-aarch64-unrecognized-command-line-option-mfpu-neon – Eugene Sh. May 20 '22 at 13:54
  • [This answer](https://stackoverflow.com/a/29891469) suggests that `-ftree-vectorize` could help, but you didn't provide a [mcve], so only you can test the effects. – Siguza May 20 '22 at 16:03
  • Neon is mandatory on `aarch64`, and therefore, there is no option to enable it. You shouldn't expect any meaningful performance gain though. Auto-vectorization is largely utterly useless on ARM. – Jake 'Alquimista' LEE May 20 '22 at 17:16
  • Your variables (arrays) might also need additional alignment to be vectorized properly. – Goswin von Brederlow May 20 '22 at 23:38
  • @Jake'Alquimista'LEE I beg to disagree. I've seen gcc and clang auto-vectorize large parts of some code I work with, with meaningful performance gains. If you're willing to look at the generated output using e.g. Compiler Explorer or objdump, and rewrite it to be more compiler-friendly (I've seen a case where simply adding the `restrict` keyword to function parameters was enough) you can get even better results. Sure an expert can often write better code by hand, but this may not be necessary/worthwhile. – swineone May 23 '22 at 15:00

0 Answers0