I was trying to make my older code run faster as I discovered, that RPi 2 processor supports NEON instructions. So I wrote this code:
__asm__ __volatile__(
"vld1.8 {%%d2, %%d3}, [%1];"
"vld1.8 {%%d4, %%d5}, [%2];"
"vaba.u8 %%q0, %%q1, %%q2;"
"vst1.64 %%d0, [%0];"
: "=r" (address_sad_intermediary)
: "r" (address_big_pic), "r" (address_small_pic)
:
);
Then in C the main sad variable is summed with sad_intermediary.
The main goal is to compute the sum of absolute differences, so I load 16 B from big_pic into q1 register, 16 B from small_pic into q2 register, calculate the SAD into q0, then load the lower 8 B from q0 into the intermediary variable. The problem is, that the resulting sad is zero.
I use GCC 4.9.2 with -std=c99 -pthread -O3 -lm -Wall -march=armv7-a -mfpu=neon-vfpv4 -mfloat-abi=hard options.
Do you see any problems with the code? Thanks.