0

I was trying to make my older code run faster as I discovered, that RPi 2 processor supports NEON instructions. So I wrote this code:

__asm__ __volatile__(
  "vld1.8 {%%d2, %%d3}, [%1];"
  "vld1.8 {%%d4, %%d5}, [%2];"
  "vaba.u8 %%q0, %%q1, %%q2;"
  "vst1.64 %%d0, [%0];"
  : "=r" (address_sad_intermediary)
  : "r" (address_big_pic), "r" (address_small_pic)
  :
);

Then in C the main sad variable is summed with sad_intermediary.

The main goal is to compute the sum of absolute differences, so I load 16 B from big_pic into q1 register, 16 B from small_pic into q2 register, calculate the SAD into q0, then load the lower 8 B from q0 into the intermediary variable. The problem is, that the resulting sad is zero.

I use GCC 4.9.2 with -std=c99 -pthread -O3 -lm -Wall -march=armv7-a -mfpu=neon-vfpv4 -mfloat-abi=hard options.

Do you see any problems with the code? Thanks.

1 Answers1

2

You never load anything into q0, so the vaba is adding the absolute difference to an uninitialised register. It also looks like you're not declaring which registers you're modifying.

But I don't know if that's the cause of your problem because I'm not too handy with inline assembly. You probably shouldn't be using inline assembly for something like this, though. If you use intrinsics then the compiler has greater ability to optimise the code. Something like this:

#include <arm_neon.h>

...
uint8x8_t s = vld1_u8(address_sad_intermediary);
s = vaba_u8(s, vld1_u8(address_big_pic), vld1_u8(address_small_pic));
vst1_u8(address_sad_intermediary, s);

(note that this code only works with eight bytes, because you only save eight bytes in your code)

Community
  • 1
  • 1
sh1
  • 4,324
  • 17
  • 30
  • But why am I not loading nothing to the q0? In this [document, p58](https://people.xiph.org/~tterribe/daala/neon_tutorial.pdf) you can see, that the result is saved into the first register, in my case q0. I didn't wanted to use intrinsics, because I read that the performance is not ideal with them. – Fehér Zoltán May 01 '16 at 09:54
  • Performance can be worse with inline assembly. Intrinsics performance has been historically terrible, but gcc-6.1 is available now, and that and modern Clang both do a reasonable job. As long as the code is simple they shouldn't mess up, and they'll handle pipeline scheduling without you having to think about it. – sh1 May 01 '16 at 17:20
  • `vaba` reads from q0 and adds to that the absolute difference of q1 and q2. You have to have something in q0 before you can perform the operation, otherwise you'll get a nonsense result. – sh1 May 01 '16 at 17:22
  • I tried the intrinsics, and after looking at generated ASM code, which looked OK, I went that way. The main problem was, that I didn't realize the `vaba` generate vector variable not a scalar. – Fehér Zoltán May 02 '16 at 10:25