fastest way to write a bitstream on modern x86 hardware

Question

What is the fastest way to write a bitstream on x86/x86-64? (codeword <= 32bit)

by writing a bitstream I refer to the process of concatenating variable bit-length symbols into a contiguous memory buffer.

currently I've got a standard container with a 32bit intermediate buffer to write to

void write_bits(SomeContainer<unsigned int>& dst,unsigned int& buffer, unsigned int& bits_left_in_buffer,int codeword, short bits_to_write){
    if(bits_to_write < bits_left_in_buffer){
        buffer|= codeword << (32-bits_left_in_buffer);
        bits_left_in_buffer -= bits_to_write;

    }else{
        unsigned int full_bits = bits_to_write - bits_left_in_buffer;
        unsigned int towrite = buffer|(codeword<<(32-bits_left_in_buffer));
        buffer= full_bits ? (codeword >> bits_left_in_buffer) : 0;
        dst.push_back(towrite);
        bits_left_in_buffer = 32-full_bits;
    }
}

Does anyone know of any nice optimizations, fast instructions or other info that may be of use?

Cheers,

Could you explain in words what the aim of this code is? (i.e., what do you mean by "write a bitstream"?) — Oliver Charlesworth, Apr 18 '11 at 14:51
@Oli Charlesworth, a bitstream is a series of integers of varying bit-length without padding. Typically the integers have been encoded so that their length is implicit, for example huffman coding, and so a reader can extract the original values without any other information. — ergosys, Apr 18 '11 at 15:36
(1) Are these bitstrings variable length? Is there any predictability in the lengths of those bitstrings? (2) Have you tried using 64-bit integers on x86-64? — rwong, Apr 18 '11 at 15:37
[High-speed software implementation of Huffman coding](http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=672291) Kawahara, M.; Yi-Jen Chiu; Berger, T. — rwong, Apr 18 '11 at 15:39
Here is a may be faster implementation: https://stackoverflow.com/questions/17320643/is-it-possible-to-do-memcpy-in-bits-instead-of-bytes/71347247#71347247 — Andry, Mar 04 '22 at 06:35

ruslik · Answer 1 · 2011-04-21T00:37:48.793

6

I wrote once a quite fast implementation, but it has several limitations: It works on 32 bit x86 when you write and read the bitstream. I don't check for buffer limits here, I was allocating larger buffer and checked it from time to time from the calling code.

unsigned char* membuff; 
unsigned bit_pos; // current BIT position in the buffer, so it's max size is 512Mb

// input bit buffer: we'll decode the byte address so that it's even, and the DWORD from that address will surely have at least 17 free bits
inline unsigned int get_bits(unsigned int bit_cnt){ // bit_cnt MUST be in range 0..17
    unsigned int byte_offset = bit_pos >> 3;
    byte_offset &= ~1;  // rounding down by 2.
    unsigned int bits = *(unsigned int*)(membuff + byte_offset);
    bits >>= bit_pos & 0xF;
    bit_pos += bit_cnt;
    return bits & BIT_MASKS[bit_cnt];
};

// output buffer, the whole destination should be memset'ed to 0
inline unsigned int put_bits(unsigned int val, unsigned int bit_cnt){
    unsigned int byte_offset = bit_pos >> 3;
    byte_offset &= ~1;
    *(unsigned int*)(membuff + byte_offset) |= val << (bit_pos & 0xf);
    bit_pos += bit_cnt;
};

edited Apr 21 '11 at 00:37

answered Apr 19 '11 at 22:01

ruslik

14,714
1
39
40

Looks like both get_bits and put_bits have the limitation of no more than 17 bits at a time. – Mark Ransom Apr 19 '11 at 22:24
@Mark Yes. Or 25 bits, without rounding. But 17 was enough for me, so I preffered even addresses. – ruslik Apr 19 '11 at 22:27
You save an instruction by not rounding, so wouldn't it be faster? I suspect with L1 cache there's no penalty for unaligned memory operations. – Mark Ransom Apr 19 '11 at 22:32
On most hardware there is still a _potential_ penalty for unaligned acccess, but the trend in x86 has been towards a smaller penalty. Much current hardware has no penalty for (non-SIMD) unaligned reads as long as they don't cross a cache-line, and a small penalty if it crosses a cache line (but you may not observe it if you aren't load-limited). – BeeOnRope Nov 23 '16 at 20:40

score 4 · Answer 2 · answered Nov 23 '16 at 21:09

It's hard to answer in general because it depends on many factors such as the distribution of bit-sizes you are reading, the call pattern in the client code and the hardware and compiler. In general, the two possible approaches for reading (writing) from a bitstream are:

Using a 32-bit or 64-bit buffer and conditionally reading (writing) from the underlying array it when you need more bits. That's the approach your write_bits method takes.
Unconditionally reading (writing) from the underlying array on every bitstream read (write) and then shifting and masking the resultant values.

The primary advantages of (1) include:

Only reads from the underlying buffer the minimally required number of times in an aligned fashion.
The fast path (no array read) is somewhat faster since it doesn't have to do the read and associated addressing math.
The method is likely to inline better since it doesn't have reads - if you have several consecutive read_bits calls, for example, the compiler can potentially combine a lot of the logic and produce some really fast code.

The primary advantage of (2) is that it is completely predictable - it contains no unpredictable branches.

Just because there is only one advantage for (2) doesn't mean it's worse: that advantage can easily overwhelm everything else.

In particular, you can analyze the likely branching behavior of your algorithm based on two factors:

How often will the bitsteam need to read from the underlying buffer?
How predictable is the number of calls before a read is needed?

For example if you are reading 1 bit 50% of the time and 2 bits 50% of time, you will do 64 / 1.5 = ~42 reads (if you can use a 64-bit buffer) before requiring an underlying read. This favors method (1) since reads of the underlying are infrequent, even if mis-predicted. On the other hand, if you are usually reading 20+ bits, you will read from the underlying every few calls. This is likely to favor approach (2), unless the pattern of underlying reads is very predictable. For example, if you always read between 22 and 30 bits, you'll perhaps always take exactly three calls to exhaust the buffer and read the underlying¹ array. So the branch will be well-predicated and (1) will stay fast.

Similarly, it depends on how you call these methods, and how the compiler can inline and simplify the code. Especially if you ever call the methods repeatedly with a compile-time constant size, a lot of simplification is possible. Little to no simplification is available when the codeword is known at compile-time.

Finally, you may be able to get increased performance by offering a more complex API. This mostly applies to implementation option (1). For example, you can offer an ensure_available(unsigned size) call which ensures that at least size bits (usually limited the buffer size) are available to read. Then you can read up to that number of bits using unchecked calls that don't check the buffer size. This can help you reduce mis-predictions by forcing the buffer fills to a predictable schedule and lets you write simpler unchecked methods.

¹ This depends on exactly how your "read from underlying" routine is written, as there are a few options here: Some always fill to 64-bits, some fill to between 57 and 64-bits (i.e., read an integral number of bytes), and some may fill between 32 or 33 and 64-bits (like your example which reads 32-bit chunks).

Nice answer. Symbol sizes in my case ranges between 1-32 bits based on entropy encoded residuals which somewhat follow a inverse exponential function distribution. What I found was that the suggested style from above seemed to have a performance benefit across a large set of inputs. — Magnus, Nov 01 '17 at 00:20

score 2 · Answer 3 · answered Jul 17 '11 at 11:10

2

You'll probably have to wait until 2013 to get hold of real HW, but the "Haswell" new instructions will bring proper vectorised shifts (ie the ability to shift each vector element by different amounts specified in another vector) to x86/AVX. Not sure of details (plenty of time to figure them out), but that will surely enable a massive performance improvement in bitstream construction code.

answered Jul 17 '11 at 11:10

timday

24,582
12
83
135

1

Unless I am mistaken, it doesn't help as much as you'd think. If you start with calls to a `put_bits` method, how do you vectorize this? You can shove the bits passed to each code into a vector (that's already expensive though), but then how do you go from there to a bitstream? You can do all the vectorized shifts, but then you'd need to horizontally or-fold the vector together. Overall it is not very obvious to me how you'd vectorize this even with variable shifts. FWIW if you *could* vectorize it well, you could probably do it without HSW using multiplications to emulate shifts. – BeeOnRope Nov 23 '16 at 20:22
1

For the generic problem it won't help much. However, using vectorized shifts, masked scattered writes you can make quite a neat n-way parallel bitstream. For large regions of non-aliased symbol sizes you can achieve quite reasonable compactness of SIMD lane interleaved output. – Magnus Nov 01 '17 at 00:05

sehe · Answer 4 · 2011-04-18T15:05:56.953

1

I don't have the time to write it for you (not too sure your sample is actually complete enough to do so) but if you must, I can think of

using translation tables for the various input/output bit shift offsets; This optimization would make sense for fixed units of n bits (with n sufficiently large (8 bits?) to expect performance gains) In essence, you'd be able to do
```
destloc &= (lookuptable[bits_left_in_buffer][input_offset][codeword]);
```

disclaimer: this is very sloppy pseudo code, I just hope it conveys my idea of a lookup table o prevent bitshift arithmetics

writing it in assembly (I know i386 has XLAT, but then again, a good compiler might already use something like that) ; Also, XLAT seems limited to 8 bits and the AL register, so it's not really versatile

Update

Warning: be sure to use a profiler and test your optimization for correctness and speed. Using a lookup table can result in poorer performance in the light of locality of reference. So, you might need to change the bit-streaming thread on a single core (set thread affinity) to get the benefits, and you might have to adapt the lookup table size to the processor's L2 cache.

Als, have a look at SIMD, SSE4 or GPU (CUDA) instruction sets if you know you'll have certain features at your disposal.

edited Apr 18 '11 at 15:05

answered Apr 18 '11 at 14:59

sehe

374,641
47
450
633

added references to libs for specific instruction sets – sehe Apr 18 '11 at 15:06
1

SSE is useless for bitstream stuff on account of not having a properly vectorized shift (it can only shift 128bit values by whole bytes, and can only shift packed types by the same amount across the entire register) – Magnus Apr 18 '11 at 15:23
3

Lookup tables are interesting. In my experience "in-place" processing tends to be quicker than look-up tables (for low-level operations) on modern machines, so I'm not convinced. I'll be sure to give it a try though to see how it preforms. – Magnus Apr 18 '11 at 15:26
@Markus: SSE is useless _unless_ you are able to combine it with the lookup table idea to prevent (low-distance) bitshifting in the first place. Also, I donot have assumptions on the amount of bits to be streamed at once. – sehe Apr 18 '11 at 16:05
Also, just gleaned [this useful link](http://gcc.gnu.org/projects/tree-ssa/vectorization.html#using) from a [recent answer on SO](http://stackoverflow.com/questions/5713776/optimizing-this-algorithm) – sehe Apr 19 '11 at 10:33
Yes, it's fair to say I didn't mention the range of symbol sizes - they're small 1-32bits in this case. For bit sizes in the 100s and 1000s, I'd expect the use of SSE/AVX to be quite important. In that case the logic to deal with the uneven tail of bits may be less important compared to the movement of data ifself. – Magnus Nov 01 '17 at 00:32

fastest way to write a bitstream on modern x86 hardware

4 Answers4

Update

Linked