Improving speed of bit copying in a lossless audio encoding algorithm (written in C)

Question

I'm trying to implement a lossless audio codec that will be able to process data coming in at roughly 190 kHz to then be stored to an SD card using SPI DMA. I've found that the algorithm basically works, but has certain bottlenecks that I can't seem to overcome. I was hoping to get some advice on how to best optimize a certain portion of the code that I found to be the "slowest". I'm writing in C on a TI DSP and am using -O3 optimization.

        for (j = packet_to_write.bfp_bits; j>0; j--)
        {
            encoded_data[(filled/16)] |= ((buf_filt[i] >> (j- 1)) & 1) << (filled++ % 16);

        }

In this section of code, I am taking X number of bits from the original data and fitting it into a buffer of encoded data. I've found that the loop is fairly costly and when I am working with a set of data represented by 8+ bits, then this code is too slow for my application. Loop unrolling doesn't really work here since each block of data can be represented by a different number of bits. The "filled" variable represents a bit counter filling up Uint16 indices in the encoded_data buffer.

I'd like some help understanding where bottlenecks may come from in this snippet of code (and hopefully I can take those findings and apply that to other areas of the algo). The authors of the paper that I'm reading (whose algorithm I'm trying to replicate) noted that they used a mixture of C and assembly code, but I'm not sure how assembly would be useful in this case.

Finally, the code itself is functional and I have done some extensive testing on actual audio samples. It's just not fast enough for real-time!

Thanks!

What throughput and latency does right shifting have on that DSP? Does it have any bitfield instructions that the compiler might not be taking advantage of, or that might be useful for a modification of the loop? Is the compiler correctly hoisting `filled` out of the loop, and keeping the `encoded_data[filled/16]` in a register across the inner loop instead of store/reload? — Peter Cordes, Aug 15 '21 at 10:24
This code might actually invoke *undefined behavior* due to modifying `filled++` while it is also being used earlier in the same sequence (I'm not 100% sure about this, but it looks very suspicious) — UnholySheep, Aug 15 '21 at 10:25
IIUC you are reversing the bits from the input and packing them in the output. Try using one of the fast bit-reverse methods such as a 256-byte table, then shift and mask X bits at a time. — stark, Aug 15 '21 at 11:36
I see `buf_filt[i]`. Where does the `i` come from? Is this a nested loop? To get a real answer, you'll have to provide more code. There may very well be ways to do your processing without nested loops. Also, what is your entire data path? What are your timeline requirements? You might want to consider just capturing the raw data on the DSP and doing this processing on something faster or when you have more time in the processing pipeline. — Andrew Henle, Aug 15 '21 at 11:52
@PeterCordes *keeping the `encoded_data[filled/16]` in a register across the inner loop* We don't have the complete code. The `filled++` could change the result of that calculation as the inner loop iterates. But taking your comment further, I suspect there's a good chance this inner loop could be eliminated entirely - the question implies that `encoded_data` is an array of `uint16_t` values, so each array element likely can be directly assigned in full, without having to loop through each bit. But without the full code, it's impossible to say. — Andrew Henle, Aug 15 '21 at 11:59
@UnholySheep I don't *think* this invokes UB, but it definitely would be more readable if that one long `encoded_data[(filled/16)] |= ((buf_filt[i] >> (j- 1)) & 1) << (filled++ % 16);` line were split over multiple lines, even if temporary variables need to be used. If the compiler is so crappy that it can't eliminate such temporaries when optimizing, **then** the operations can be crammed into one hard-to-read line. — Andrew Henle, Aug 15 '21 at 12:03
At a glance, it looks like you're reading some consecutive number of bits one bit at a time. You can probably use masking and shifting of whole sets of bits instead. This sort of thing could likely be optimized to align the two to make copying the middle cheaper. Related: https://stackoverflow.com/a/3535337 — Hasturkun, Aug 15 '21 at 12:06
@Andrew Henle: I'm almost certain this is UB; the entire statement is a single sequence (that is, the only sequence point is the end of the statement) and filled is both modified and used for a value computation. The standard considers those operations unsequenced and therefore UB. — sj95126, Aug 15 '21 at 13:59
@sj95126 And our little discussion is exactly why I almost ***never*** use the `++` operator embedded in a dense, confusing line of code like in this very question. Put `filter++;` on a separate line after the value is used and there's no confusion or possible UB. Because even if it's not UB now, some minor change in the future could result in UB. "B-b-b-but I was taught to cram as much code into one line as possible." "Not by anyone who's ever been called in at 3 AM to fix a bug caused by confusing code." — Andrew Henle, Aug 15 '21 at 14:19
"_kHz_" - do you mean _kbps_ (bits) or _ksps_ (samples) perhaps? Hz is an ambiguous term in this context. — Clifford, Aug 15 '21 at 16:50
You probably need to identify the "_bottlenecks_" before attempting to optimise specific code that may not be the issue. You mention writing to SD cards, they notoriously have variable latency - often at specific boundaries depending on the card implementation and performance - and individual cards vary greatly. The issue was the subject of one of my first questions on SO https://stackoverflow.com/questions/3307428/how-can-i-use-an-sd-card-for-logging-16-bit-data-at-48-ksamples-s which may be of interest. — Clifford, Aug 15 '21 at 16:59

Matt Timmermans · Answer 1 · 2021-08-15T15:05:21.847

You really need to change the representation that you use for the output data. Instead of just a target buffer and the number of bits written, expand this to:

//complete words that have been written
uint16_t *encoded_data;

//number of complete words that have been written
unsigned filled_words;

//bits waiting to be written to encoded_data, LSB first
uint32_t encoded_bits;

//number of bits in encoded_bits
unsinged filled_bits;

This uses a single 32-bit word to buffer bits until we have enough to write out a complete uint16_t. This greatly simplifies the shifting and masking, because you always have at least 16 free bits to write into.

Then you can write out n bits of any source word like this:

void write_bits(uint16_t bits, unsigned n) {
    uint32_t mask = ((uint32_t)0x0FFFF) >> (16-n);
    encoded_bits |= (bits&mask) << filled_bits;
    filled_bits += n;
    if (filled_bits >= 16) {
        encoded_data[filled_words++] = (uint16_t)encoded_bits;
        encoded_bits >>= 16;
        filled_bits -= 16;
    }
}

and instead of your loop, you just write

write_bits(buf_filt[i], packet_to_write.bfp_bits);

No one-bit-at-a-time operations are required.

Improving speed of bit copying in a lossless audio encoding algorithm (written in C)

1 Answers1