Optimize a weighted moving average

Question

Environment : STM32H7 and GCC
Working with a flow of data : 1 sample received from SPI every 250 us
I do a "triangle" weighted moving average with 256 samples, like this but middle sample is weighted 1 and it forms a triangle around it
My samples are stored in uint32_t val[256] circular buffer, it works with a uint8_t write_index
The samples are 24 bits, the max value of a sample is 0x00FFFFFF

uint8_t write_idx =0;
uint32_t val[256];
float coef[256];

void init(void)
{
  uint8_t counter=0;
  // I calculate my triangle coefs
  for(uint16_t c=0;c<256;c++) 
  {
    coef[c]=(c>127)?--counter:++counter;
    coef[c]/=128;
  }
}

void ACQ_Complete(void)
{
  uint32_t moy=0;
  // write_idx is meant to wrap
  val[write_idx++]= new_sample;
  // calc moving average (uint8_t)(c-write_idx) is meant to wrap
  for(uint16_t c=0;c<256;c++)
    moy += (uint32_t)(val[c]*coef[(uint8_t)(c-write_idx)]);
  moy/=128;
}

I have to do the calcs during a 250 us time span, but I measured with a debug GPIO pin that the "moy" part takes 252 us
Code is simulated here
Interesting fact : If I remove the (uint32_t) cast near the end it takes 274 us instead of 252 us

How can I get it done faster ?

I was thinking of using uint32 instead of float for coef (by multiply by 1000 for example) but my uint32 would overflow

You average looks like a circular convolution. In this case, a large gain can be expected by usihg a FFT. — Damien, Mar 18 '21 at 09:42
Other than that, does your benchmarking take hardware FPU vs software floating point in account? Do you actually _need_ floating point and why? — Lundin, Mar 18 '21 at 09:43
if I multiply my coef by 1000 to make them `uint16_t` instead of `float` my `uint32_t moy` would overflow — GabrielT, Mar 18 '21 at 09:46
@Damien Are you sure a FFT wouldn't be heavier than a moving average ? — GabrielT, Mar 18 '21 at 09:48
It depends if you want to calculate one average only (complexity O(n) then), or if you want to calculate this average for all the data (complexity O(n^2)). A FFT is O(n logn). Not clear for me what you want to do exactly. You should provide a complete code, with main(), including means to measure the time of the function. — Damien, Mar 18 '21 at 09:51
@Damien I want to calculate a moving average for a flow of data, I receive 1 sample from SPI every 250 us then I do my ACQ_Complete() and I send the data to the PC throw HID USB — GabrielT, Mar 18 '21 at 09:55
Also there's literally no difference between what you have and `moy += val[c] * coef[c-write_idx];`. The casts just added pointless bloat. If you want a specific type then cast _before_ you carry out an operation, not afterwards. — Lundin, Mar 18 '21 at 10:24
@Lundin Here are my settings, I removed the -l : `-mcpu=cortex-m7 -std=gnu11 -g3 -DUSE_HAL_DRIVER -DSTM32H743xx -DDEBUG -c -Wall -fstack-usage --specs=nano.specs -mfpu=fpv5-d16 -mfloat-abi=hard -mthumb` . I was at none optimisation I added -Ofast and now it's 62 us. Thank you very much, I forgot i was with none optimization ! — GabrielT, Mar 18 '21 at 10:39
`-O2` is generally what I'd recommend for ARM GCC. `-Ofast` drops standard compliance which might cause all manner of crazy stuff. — Lundin, Mar 18 '21 at 11:06
Is this not simply an FIR filter - with a triangular impulse response? The CMSIS has optimised functions for that (will use the intrinsic M7 DSP and SIMD instructions that the C compiler might not utilise). https://arm-software.github.io/CMSIS_5/DSP/html/group__FIR.html — Clifford, Mar 18 '21 at 21:18

Tom V · Accepted Answer · 2021-03-18T16:55:52.750

2

This should unquestionably be done in integer. It will be both faster and more accurate.

This processor can do 32x32+64=64 multiply accumulate in a single cycle!

Multiply all your coefficients by a power of 2 (not 1000 mentioned in the comments), and then shift down at the end rather than divide.

uint32_t coef[256];

uint64_t moy = 0;

for(unsigned int c = 0; c < 256; c++)
{
   moy += (val[c] * (uint64_t)coef[(c - write_idx) & 0xFFu]);
}

moy >>= N;

edited Mar 18 '21 at 16:55

answered Mar 18 '21 at 14:32

Tom V

4,827
2
5
22

Hi, I see two issues,`moy` will overflow because my samples are 24 bits and `c-write_idx` will not wrap correctly without a cast. I think when you write `moy/=128` gcc already makes it a shift (i can check disasm), so it will not improve this part – GabrielT Mar 18 '21 at 14:54
My apologies I changed the size of the accumulator in the code from 32 to 16 bits when I meant to change it from 32 to 64 as I wrote in the text. – Tom V Mar 18 '21 at 16:56

Optimize a weighted moving average

How can I get it done faster ?

1 Answers1