Qualcomm Hexagon: Optimize MAC-loop

Question

I would like to DSP-optimize a simple multiply-accumulate for-loop for the QC Hexagon. From the manual, it's not perfectly clear to me how to do that, both for the vector version and the non-vector version.

Assume my loop has a length which is a multiple of 4 (e.g., 64), i.e., I want to unroll the loop with a factor of 4. How would I do that? I can use either C-intrinsics or asm-code, but I don't understand how to do the 4x-memory load in first place.

Here is how my loop could look like in C:

Word32 sum = 0;
Word16 *pointer1; Word16 *pointer2;

for (i=0; i<64; i++)
{
    sum += pointer1[I]*pointer2[i];
}

Any suggestions?

Are you asking how to do simple scalar unrolling, with `point2[i+0]` .. `pointer2[i+3]` in the loop body? You might want to use four separate `sum0` .. `sum3` accumulators to encourage the compiler in that direction, in case it doesn't do that for you even with integer accumulators. (And of course same thing with vector accumulators, if Hexagon has a widening integer multiply. Or do you want to zero-extend a 16x16 multiply and then sum it into a 32-bit accumulator? IDK whether Word16 is narrower than `int`; if so the operands of `*` will implicitly promote to `int`.) — Peter Cordes, Sep 08 '21 at 08:36

score 0 · Answer 1 · answered Sep 16 '21 at 01:38

Here is a FIR filter implementation that demonstrates how to use Q6_P_vrmpyhacc_PP, the multiply halfword/accumulate. This instruction is described as 'big mac' in the PRM

This instruction is in the scalar core so it does not require the HVX vector coprocessor.

void FIR08(short_8B_align Input[],
           short_8B_align Coeff[],
           short_8B_align Output[],
           int unused, int ntaps,
           int nsamples)
{
  Word64 * vInput = (Word64*)Input;
  Word64 * vCoeff = (Word64*)Coeff;
  Word64 *__restrict vOutput = (Word64*)Output;
  int i, j;
  Word64 sum0, sum1, sum2, sum3;

  for (i = 0; i < nsamples/4; i++)
  {
      sum0 = sum1 = sum2 = sum3 = 0;
      for (j = 0; j < ntaps/4; j++)
      {
          Word64 vIn1 = vInput[i+j];
          Word64 vIn2 = vInput[i+j+1];
          Word64 curCoeff = vCoeff[j];
          Word64 curIn;

          curIn = vIn1;
          sum0 = Q6_P_vrmpyhacc_PP(sum0, curIn, curCoeff);

          curIn = Q6_P_valignb_PPI(vIn2, vIn1, 2);
          sum1 = Q6_P_vrmpyhacc_PP(sum1, curIn, curCoeff);

          curIn = Q6_P_valignb_PPI(vIn2, vIn1, 4);
          sum2 = Q6_P_vrmpyhacc_PP(sum2, curIn, curCoeff);

          curIn = Q6_P_valignb_PPI(vIn2, vIn1, 6);
          sum3 = Q6_P_vrmpyhacc_PP(sum3, curIn, curCoeff);
      }

      Word64 curOut = Q6_P_combine_RR(Q6_R_combine_RhRh(sum3, sum2), Q6_R_combine_RhRh(sum1, sum0));
      vOutput[i + 1] = Q6_P_vasrh_PI(curOut, 2);
  }
}

Qualcomm Hexagon: Optimize MAC-loop

1 Answers1