Very fast way to check set bit in C

Question

I'm using some sort of BitStream in my code that has a read_bit()-function. This function is called very very often (more than one billion times in a single stream). This is what the struct BitStream looks like:

typedef struct BitStream {
    unsigned char* data;
    unsigned int size;
    unsigned int currentByte;
    unsigned char buffer;
    unsigned char bitsInBuffer;
} BitStream;

And the read_bit()-function is defined as follows:

unsigned char bitstream_read_bit(BitStream* stream, unsigned long long bitPos) {
    unsigned int byte = bitPos / 8;
    unsigned char byteVal = stream->data[byte];
    unsigned char mask = 128 >> (bitPos & 7);
    if (mask & byteVal) {
        return 1;
    } else {
        return 0;
    }
}

Now, I found out through trial-and-error that the line unsigned char mask = 128 >> (bitPos & 7); is very slow. Is there some way that I can speed up the check of a bit? I've already tried to use an array that indexes the 8 different possible masks, but this is not faster (I think due to memory access).

EDIT: I tried a lot of the answers over the past week and performed a lot of benchmarks but there wasn't a lot of performance improvement. I eventually managed to get a 10 seconds improvement by reversing the order of the bits in the bitstream. So instead of using the mask 128 >> (bitPos & 7), I used the function:

unsigned char bitstream_read_bit_2(BitStream* stream, const unsigned long long bitPos) {
    unsigned int byte = (unsigned int) (bitPos / 8);
    unsigned char byteVal = stream->data[byte];
    unsigned char mod = bitPos & 7;
    return (byteVal & (1 << mod)) >> mod;
}

I have obviously also changed the corresponding write-function.

How slow is it at the moment? How "slow" (but faster than current) is acceptable? How much memory can you dedicate for this? Can you include the disassembly of the current implementation? — Amit, Oct 22 '16 at 16:40
The particular line uses about 10s from in total 28s. It should at least be possible to make it work in 5s (or less). I can dedicate quite a bit of memory for this (at least 10MB). I will post the disassembly soon. Thank you in advance — , Oct 22 '16 at 16:42
that saves time also: `return ((mask & byteVal)!=0);`. And maybe `bitPos / 8` => `bitPos >> 3`. Which optimization options are you using? — Jean-François Fabre, Oct 22 '16 at 16:47
I'm using the Windows Compiler with -O2 optimization level (= Maximize speed) @J.Piquard I've tried the static mask array, but it's just as slow. I've placed it simply at the top of the BitStream.c file, or should I declare it somewhere else? — , Oct 22 '16 at 16:48
What's the full scenario? How often do you spend these 28 seconds, and why is it important to reduce that to 23? If you're calling the function 1e+9 times, you probably do this sequentially - you should use this for your advantage. — Amit, Oct 22 '16 at 16:50
could you use `unsigned long bitPos` ? I don't see the point of the `long long` here unless you work with a buffer > 500Megs. — Jean-François Fabre, Oct 22 '16 at 16:51
Instead of a mask array, a lengthy `switch` might be more effective — Amit, Oct 22 '16 at 16:51
I tried the switch, but it has no effect. I've changed the unsigned long long to unsigned long because I indeed do not need a 500MB buffer :) Thank you. But there's not a lot of improvement for the moment — , Oct 22 '16 at 16:59
Did you tried to write it as a macro and not a function to be called? function call is probably more costly than your little mask computation. Also take care about @Amit first remark... — Jean-Baptiste Yunès, Oct 22 '16 at 17:03
Can you maybe read multiple bits in one call? E.g. unpack a whole byte into an 8 byte vector. Should be possible with a single multiplication I think. Optimizing this code is going to be very hard because all it does is super cheap. So maybe optimize higher-level. — usr, Oct 22 '16 at 17:26
Since you're using division, can you use `div()` (or `ldiv()` or `lldiv()`) from `` to get the quotient and the remainder in one operation (`div_t r = div(bitPos, 8);` and then use `r.quot` in place of `byte` and `r.rem` in place of `mask`)? Does it alter the speed at all? — Jonathan Leffler, Oct 22 '16 at 17:39

score 2 · Answer 1 · answered Oct 22 '16 at 18:35

The obvious first improvement is to shift the loaded value rather than the mask:

unsigned char bitstream_read_bit(BitStream* stream, unsigned long long bitPos) {
    unsigned int byte = bitPos / 8;
    unsigned char byteVal = stream->data[byte];
    unsigned char maskVal = byteVal >> (bitPos & 7);
    return maskVal & 1;
}

This removes the need for a conditional (No if or ! or ?:).

If you can modify the struct, I'd recommend accessing by larger units than bytes:

#include <stddef.h>
#include <limits.h>
#include <stdbool.h>

typedef struct WBitStream
{
  size_t *data;
  size_t size;
} WBitStream;

bool Wbitstream_read_bit(WBitStream* stream, size_t bitPos)
{
  size_t location = bitPos / (sizeof(size_t)*CHAR_BIT);
  size_t locval = stream->data[location];
  size_t maskval = locval >> (bitPos & (sizeof(size_t)*CHAR_BIT-1));
  return maskval & 1;
}

On some processors (notably the common x86), the mask of the shift-amount is a NOP, since the processor's native shift instruction only considers the low bits of the shift amount anyway. At least gcc knows about this.

score 1 · Answer 2 · answered Oct 22 '16 at 18:47

I have tested to optimzed macro compared to your initial source code:

static unsigned char tMask[8] = { 128, 64, 32, 16, 8, 4, 2, 1 };

#define BITSTREAM_READ_BIT1(stream, bitPos) (((128 >> (bitPos & 7)) & stream->data[bitPos >> 3])!=0)
#define BITSTREAM_READ_BIT2(stream, bitPos) (((tMask[(bitPos & 7)]) & stream->data[bitPos >> 3])!=0)

Replacing mask computation by mask in array doesn't increase performance. The main gap is between function and macro (6 times faster on my computer with 80.000.000 of calls).

And the static inline use is not far from the macro.

score 0 · Answer 3 · answered Oct 22 '16 at 17:49

0

Here's how I initially optimized your code:

unsigned char bitstream_read_bit(BitStream* stream, unsigned long long bitPos) 
{
    return !!(stream->data[(bitPos / 8)] & (128 >> (bitPos % 8)));
}

But the function call overhead itself is likely more instructions than the bit tweaking code inside it. So if you really want to optimize it even further, let's take advantage of inlining and just convert it to a macro:

#define bitstream_read_bit(stream, bitPos) (!!((stream)->data[((bitPos) / 8)] & (128 >> ((bitPos) % 8))))

answered Oct 22 '16 at 17:49

selbie

100,020
15
103
173

1

It doesn't matter. The function call overhead is far more than the cost of having an inefficient bit-tweaking operation. But that doesn't mean we can't combine both of our solutions together. – selbie Oct 22 '16 at 17:56
Or take advantage of inlining by prefixing the function with `static inline`? – Jonathan Leffler Oct 22 '16 at 18:06
1

@Mike Last time I checked, gcc was able to optimize / and % with power-of-two intermediates. They should be no slower than the equivalent bit ops. – Petr Skocik Oct 22 '16 at 19:38

Very fast way to check set bit in C

3 Answers3