I would like to DSP-optimize a simple multiply-accumulate for-loop for the QC Hexagon. From the manual, it's not perfectly clear to me how to do that, both for the vector version and the non-vector version.
Assume my loop has a length which is a multiple of 4 (e.g., 64), i.e., I want to unroll the loop with a factor of 4. How would I do that? I can use either C-intrinsics or asm-code, but I don't understand how to do the 4x-memory load in first place.
Here is how my loop could look like in C:
Word32 sum = 0;
Word16 *pointer1; Word16 *pointer2;
for (i=0; i<64; i++)
{
sum += pointer1[I]*pointer2[i];
}
Any suggestions?