I have a C code which uses Neon Intrinsics which will run in a Raspberry Pi 4 (Cortex-A72). When I compile the code with the built-in GCC:
In Raspberry Pi OS 32-bits (ARM - armv7l), if I run
gcc -o test test.c -march=native -mcpu=native -mtune=native -O3 -g
I get a performance of 1,100,000 ns (the number change with each test but it's more than 1 million).
Still in Raspberry Pi OS 32-bits, if I run
gcc -o test test.c -march=native -mcpu=native -mtune=native -mfpu=neon -O3 -g
i.e., adding -mfpu=neon, I get a performance of 12,000 ns (because the code was designed to use the parallelism provided by vaddq_u32, veorq_u32 and others.
Now, in Raspberry PI OS 64-bits (aarch64), if I run
gcc -o test test.c -march=native -mcpu=native -mtune=native -O3 -g
I get a performance of 1,100,000 ns as well.
But there is the problem. There is no -mfpu supported by GCC on aarch64, and supposedly SIMD+FP is on by default on aarch64, but the performance is still bad.
Things that I've tried:
Build with the built-in GCC with -march=armv8-a+simd -mcpu=cortex-a72+simd and similar options for -march to use SIMD+FP, but the performance is still slow. (The list of options is here: https://gcc.gnu.org/onlinedocs/gcc/AArch64-Options.html)
Use cross-compilers (like GCC-arm-none-elf and GCC-arm-none-eabi) from the ARM developer site. Some of their gcc versions have the -mfpu option, but either I get too many errors or the performance is the same. (https://developer.arm.com/tools-and-software/open-source-software/developer-tools/gnu-toolchain/downloads)
Is there (in the aarch64 built-in GCC) any command that forces to use SIMD+FP as -mfpu for 32-bits built-in GCC?
Thank you.