I'm trying to implement a lossless audio codec that will be able to process data coming in at roughly 190 kHz to then be stored to an SD card using SPI DMA. I've found that the algorithm basically works, but has certain bottlenecks that I can't seem to overcome. I was hoping to get some advice on how to best optimize a certain portion of the code that I found to be the "slowest". I'm writing in C on a TI DSP and am using -O3 optimization.
for (j = packet_to_write.bfp_bits; j>0; j--)
{
encoded_data[(filled/16)] |= ((buf_filt[i] >> (j- 1)) & 1) << (filled++ % 16);
}
In this section of code, I am taking X number of bits from the original data and fitting it into a buffer of encoded data. I've found that the loop is fairly costly and when I am working with a set of data represented by 8+ bits, then this code is too slow for my application. Loop unrolling doesn't really work here since each block of data can be represented by a different number of bits. The "filled" variable represents a bit counter filling up Uint16 indices in the encoded_data buffer.
I'd like some help understanding where bottlenecks may come from in this snippet of code (and hopefully I can take those findings and apply that to other areas of the algo). The authors of the paper that I'm reading (whose algorithm I'm trying to replicate) noted that they used a mixture of C and assembly code, but I'm not sure how assembly would be useful in this case.
Finally, the code itself is functional and I have done some extensive testing on actual audio samples. It's just not fast enough for real-time!
Thanks!