Arm-neon optimized version of SAD 16*4 not giving expected gain

Question

I wrote a 16*4 SAD function and its arm-neon optimized version. The arm-neon version is written in inline assembly. My problem is I am getting only 2x optimization ( with O3 enabled ), while ideally I should be getting at least 6x optimization out of it. Can anyone please explain the internals of what is happening ?

static unsigned int f_sad_16x4 ( const unsigned char* a, const unsigned char* b, const unsigned int uiStrideOrg, const unsigned int uiStrideCur )
{
  unsigned int sad = 0;
  for (int i = 0; i < 4; i++) 
  {
      for (int j = 0; j < 16; j++)
      {
          sad += abs(static_cast<int>(a[i*uiStrideOrg+j]) - static_cast<int>(b[i*uiStrideCur+j]));
      }
  }
  return sad;
}

static unsigned int f_sad_16x4_neon(const unsigned char* a, const unsigned char* b, const unsigned int uiStrideOrg, const unsigned int uiStrideCur )
{
    unsigned short auiSum[8];
    unsigned short* puiSum = auiSum;

    __asm__ volatile(                             \

    /* Load 4 rows of piOrg and piCur each */
    "vld1.8 {q0},[%[piOrg]],%[iStrideOrg]   \n\t"\
    "vld1.8 {q4},[%[piCur]],%[iStrideCur]   \n\t"\
    "vld1.8 {q1},[%[piOrg]],%[iStrideOrg]   \n\t"\
    "vabd.u8 q8,  q0, q4                    \n\t"\
    "vld1.8 {q5},[%[piCur]],%[iStrideCur]   \n\t"\
    "vld1.8 {q2},[%[piOrg]],%[iStrideOrg]   \n\t"\
    "vabd.u8 q9,  q1, q5                    \n\t"\
    "vld1.8 {q6},[%[piCur]],%[iStrideCur]   \n\t"\
    "vld1.8 {q3},[%[piOrg]],%[iStrideOrg]   \n\t"\
    "vabd.u8 q10, q2, q6                    \n\t"\
    "vld1.8 {q7},[%[piCur]],%[iStrideCur]   \n\t"\
    "vpaddl.u8 q12, q8                      \n\t"\
    "vabd.u8 q11, q3, q7                    \n\t"\
    "vpaddl.u8 q13, q9                      \n\t"\
    "vpaddl.u8 q14, q10                     \n\t"\
    "vadd.u16 q8, q12, q13                  \n\t"\
    "vpaddl.u8 q15, q11                     \n\t"\
    "vadd.u16 q9, q14, q15                  \n\t"\
    "vadd.u16 q0, q8, q9                    \n\t"\
    "vst1.16 {q0}, [%[puiSum]]              \n\t"\
    :[piOrg]        "+r"    (a),
     [piCur]        "+r"    (b),
     [puiSum]       "+r"    (puiSum)
    :[iStrideCur]   "r"     (uiStrideCur),
     [iStrideOrg]   "r"     (uiStrideOrg)
    :"q0","q1","q2","q3","q4","q5","q6","q7","q8","q9","q10","q11","q12","q13","q14","q15"
    );

    unsigned int uiSum += auiSum[0] + auiSum[1] + auiSum[2] + auiSum[3] + auiSum[4] + auiSum[5] + auiSum[6] + auiSum[7];


    return uiSum;
}

Doing such a small amount of work in a function can mean that function call overheads swap any optimisation gains - if you're calling this code in a loop then consider either (a) re-factoring so that the optimised code is in the loop rather than in a separate function, and move any set up /tear down stuff out of the loop or (b) re-write the function using intrinsics and make it inline - that way the compiler can get rid of any function preamble/postamble code and also re-sechdule the instructions and probably even do a better job of allocating registers. — Paul R, Sep 30 '14 at 07:07
Did you check the assembly code that the compiler generates out of the first function? Maybe it is using the neon unit already... — Christoph Freundl, Sep 30 '14 at 07:31
It is not using neon instructions in the unoptimized version — user3249055, Sep 30 '14 at 07:58
Here's the generated assembly code : https://onedrive.live.com/?cid=8715c900af878467&id=8715C900AF878467%21627 — user3249055, Sep 30 '14 at 08:11
Nearly half of that code is loads and stores - are you expecting to find a threefold increase in memory bandwidth out of nowhere? ;) — Notlikethat, Sep 30 '14 at 09:31

Charles Baylis · Answer 1 · 2014-09-30T23:11:22.590

1

This code performs poorly because the compiler has to emit 23 integer instructions in addition to the 20 NEON instructions in your inline assembler block.

The simplest part to fix is this line:

unsigned int uiSum += auiSum[0] + auiSum[1] + auiSum[2] + auiSum[3] + auiSum[4] + auiSum[5] + auiSum[6] + auiSum[7];

This final reduction step can be performed on the NEON unit. eg

VADDL.S16   q0, d0, d1     // 32 bit lanes in q0
VPADDL.S32  q0, q0         // 64 bit lanes in q0
VADD.I64    d0, d0, d1     // one 64 bit result in d0

You can then retrieve the result with a single move:

VMOV %n, %Hn, d0           // retrieve 64 bit result

In the above, you need to set n to correspond the appropriate operand for the result variable in the inline asm outputs block.

The other problem is the register allocation is suboptimal. The registers d8 to d15 (q4 to q7) must be preserved by any function which uses them, and as a result the compiler emits code to do that. You can rewrite your function to reuse registers and avoid using these registers.

This function would benefit from use of NEON intrinsics. That would avoid the need to worry about register allocation, and will also make your code portable to Aarch64

edited Sep 30 '14 at 23:11

answered Sep 30 '14 at 09:51

Charles Baylis

851
7
8

I didn't understand the part where you say d8 to d15 must be preserved. Why is it so ? – user3249055 Sep 30 '14 at 10:52
It is a requirement of the ARM ABI. The compiler assumes that a function call will not change the values in d8-d15, so it will store variables in those registers. Therefore, a function which uses those registers must restore the original values before it returns to prevent the calling function from getting incorrect results. – Charles Baylis Sep 30 '14 at 12:51
If I am not wrong, the latency for inter pipeline data transfer is too much and hence must be avoided. In that sense, would it be wise to use the VMOV %n, %Hn, d0 instruction at the end ? – user3249055 Oct 01 '14 at 04:59
If you transfer data from NEON registers to ARM registers, there is an unavoidable latency which results from the pipeline, regardless of the instruction sequence used. On some micro-architectures, you may be able to reduce the penalty with a sequence like `VST1, , LDR`, but `VST1, LDR` as an adjacent pair will almost certainly perform worse than `VMOV` – Charles Baylis Oct 01 '14 at 09:16
@CharlesBaylis The same penalty of 12 cycles apply as well when both cores access the same memory. – Jake 'Alquimista' LEE Oct 01 '14 at 10:35
@Charles Baylis : I made the necessary corrections to my code as you suggested. Although the gain has improved , it is still not up to the expected mark. What else can I do to better the gain? PS : The optimized version is being compared against -O3 compiled unoptimized code. – user3249055 Oct 05 '14 at 12:57
@user3249055 Can you put your new function (in compilable form) online somewhere? – Charles Baylis Oct 07 '14 at 16:51

Arm-neon optimized version of SAD 16*4 not giving expected gain

1 Answers1