What data type to use in a bit array

Question

When making a bit array or also called a bit set in C what data type should I use? Should I use an array of int? unsigned int? size_t? uint8_t? uintmax_t?

For me, using signed integer types is a no-no as noted by other answers here in SO about signed integer left and right shifts (I lost the link to the answer).

Now, should I use the smallest available unsigned integer or the biggest? Which one has the best performance?

Some of my thoughts: the majority of the bit arrays here in SO use an array of char or uint8_t, but I can't see how that would be better than using uintmax_t (also because I haven't seen any arguments for why it is the way to go, hence this question). When making certain operations like union and intersection between two bit arrays the loop iterates fewer times when using a bigger unsigned integer.

Edit: upon seeing some answers I think some people got confused to what I'm asking. Sorry about that. What I'm trying to do is to make a bit array where each bit can be individually accessed or set to either 1 or 0. I know I could just use a bool array but that is not space efficient. You could say that nowadays we have RAMs that are big enough and the gain of bit arrays over boolean arrays is minimal but that's not the point here. What I'm trying to know is, given a bit array where each bit can be modified or accessed by a bit_index (which is different from the array's index) what data type should be my array?

Understand the types. All of them represent integers (not only `int`). Using signed integers is a no-no, but with unsigned your're safe. `char` is simpler as it has `CHAR_BIT` bits, using `uintmax_t` would need to count `sizeof(uintmax_t) * CHAR_BIT` bits which is a bit longer to write. Also `uintmax_t` can generate unoptimal code. I guess `int` would/should be the fastest and most memory efficient, but I would go for `char` for readability and scalability. You are asking a opinion based question about how to approach your implementation. — KamilCuk, Dec 23 '18 at 21:39
@KamilCuk Why do you think it is opinion based? I'm legitimately trying to know if there is any performance differences when using different data types for a bit array. — LeoVen, Dec 23 '18 at 21:48
`typedef unsigned char byte` it's a type in some libraries already anyway — Chris Rollins, Dec 23 '18 at 21:51
it's actually quite trivial to make a bit array that is not wasting any memory. you can access individual bits using bitwise operations. you can create an interface that abstracts this away for you as well. it may be a little awkward in C compared to an OOP language but you can still do it. — Chris Rollins, Dec 23 '18 at 21:54

fdk1342 · Answer 1 · 2018-12-23T21:43:45.070

This depends upon how many bits you need to keep track of, the efficiency of accessing a single bit, and how much memory you want to consume in keeping track of all these bits.

There are so many ways of doing this without further details its hard to answer.

What I've seen is a simple array of uint32_t to keep it packed and decent access speeds. Then when accessing a bit, lets say bit 128, this would be bit 0 of the 4th uint32_t of the array.

score 2 · Accepted Answer · answered Dec 23 '18 at 23:01

I would personally use size_t. For most (not all, but probably all of the ones that you care about) platforms, it is the same size as your CPU registers, which means that for many operations that need to scan the entire bit vector it achieves the maximum number of bits processed per loop iteration (e.g. finding set bits, counting bits, etc). For such algorithms, CPU builtins like bsf (bit scan forward) and clz (count leading zeros) will significantly speed up your algorithm.

Just for context, the Linux kernel uses unsigned long for bit vectors, which AFAIK is the same as size_t on all Linux ABIs, but is not on Windows (at least not with MSVC) where long is 32 bits even on x64.

chux - Reinstate Monica · Answer 3 · 2018-12-23T22:56:28.367

What data type to use in a bit array (?)
... where each bit can be individually accessed or set to either 1 or 0.
... could just use a bool array but that is not space efficient.

You cannot get all you want: Need to make trade-offs.

For an N bit "array", various approaches

Array of _Bool: _Bool ar1[N];

Pro: Easy to index: ar1[i]
Pro: Only ever 2 values.
Con: Space inefficient - perhaps even worse than unsigned char ar2[N];

Array of smallest type: unsigned char ar2[N];

Pro: Easy to index: ar2[i]
Pro: No trap values and no padding.
Con: Can encode values 0,1 and others.
Con: Space inefficient

Array of packed unsigned char: unsigned char ar3[(N+CHAR_BIT-1)/CHAR_BIT];

Pro: Space efficient.
Pro: No trap values and no padding.
Con: Needs helper code to index: (ar3[i/CHAR_BIT] >> (i%CHAR_BIT))%2
Con: May have a few extra "array" elements.

Array of packed unsigned: unsigned ar4[(N+UNSIGNED_BIT-1)/UNSIGNED_BIT];

Pro: Space efficient.
Pro: Likely fast/faster than ar3 given use of native unsigned type.
Con: Needs helper code to index: (ar4[i/UNSIGNED_BIT] >> (i%UNSIGNED_BIT))%2
Con: May have a few extra "array" elements.
***: Pedantic concern that unsigned may be padded results in a more complex bit width determination as UNSIGNED_BIT needs to be bases on UNSIGNED_MAX rather than CHAR_BIT.

Conclusion

IMO, use _Bool ar1[N]; until the space/speed is demonstrated to be a problem. In which case I'd move to unsigned ar4[(N+UNSIGNED_BIT-1)/UNSIGNED_BIT];

For me, using integer types is a no-no as noted by other answers here in SO about signed integer left and right shifts

OP's concern is overstated here. The major shift issues arise with signed types. Use unsigned types instead.

use an array of char or uint8_t, but I can't see how that would be better than using uintmax_t.

Presumable OP is taking about a "packed" array of bits here.

Con for uintmax_t. It obliges the array size to be a multiple of the bit size of uintmax_t versus the easier to match uint8_t. Otherwise wasted memory either way, just less with uint8_t.
Con for uint8_t. It is not always available (this is exceptional).
Con for char. It may be signed
Con for uint8_t. Presumably as slow or slower than unsigned.
Con for uintmax_t. Unless the code natively support that wide type, the emitted code may be slower than other alternatives.
Con for uintmax_t. Wide types are more likely to need multiple instructions that narrow types. Of course this various between platforms.

Ideally using the widest native type would be best - this is often unsigned.

IMO, unsigned is the superior choice for packing.

Sorry I forgot to add the "signed integer types". Got it fixed now. Thanks. — LeoVen, Dec 23 '18 at 22:35
For a bitarray of 1 bit; `unsigned char` will be probably be one byte followed by 15 bytes of padding to align whatever is next in memory (16 bytes total), and `unsigned long long` will be probably be 8 bytes followed by 8 bytes of padding to align whatever is next in memory (16 bytes total). It's extremely unlikely that smaller sizes will actually use less memory in practice. — Brendan, Dec 24 '18 at 01:06
@Brendan A 16-byte alignment on variables and arrays? What processor does that? - certainly not the billions of new embedded processors per year where memory is used far more conservatively than desktops. Typically objects of like size can be found grouped together. Or instead of _arrays_ are you commenting on 16 byte alignment on memory _allocations_ - something not brought up by OP. — chux - Reinstate Monica, Dec 24 '18 at 02:47

score 1 · Answer 4 · answered Dec 23 '18 at 22:42

The best option would be to benchmark as many schemes as possible. Depending on how many bits you are intending to store and how often you are going to read and write them, it may make sense to store each bit as an unsigned char (or even in an unsigned int), but packing 16 of them tighter in a 16+ bit unsigned int could make sense for a good trade-off of efficiency and ease of access. unsigned int is a good choice, but I wouldn't recommend unsigned int unless you can guarantee your intended architecture does not use padding bits or any surprise trap values. Any modern architecture probably has uint32_t (defined in stdint.h), which is my recommendation if you can't trust unsigned int because you know its exact size and it is guaranteed to not have padding bits by the standard. If you know you will be running your code on a 64-bit architecture, uint64_t would be a better choice. Remember to benchmark, if possible.

Be warned that the standard requires all operations on types smaller than int to be implicitly converted (in the C abstract machine) to int (or unsigned int if it won't fit in an int), and then converted again back to the original _Bool, char, or short. This can lead to surprises sometimes.

score 1 · Answer 5 · answered Dec 24 '18 at 00:57

In general, the most efficient size when dealing with individual bits will probably be unsigned int. The largest size and the register size can be inefficient (e.g. on 64-bit 80x86, 64-bit instructions need "REX prefixes" and this will cause pointless bloat for no benefit).

For dealing with the entire bitset (e.g. searching, counting), if performance matters in the first place, then size mostly doesn't matter. For example (for SSE2), you could pack sixteen 8-bit integers into a 128-bit register, or eight 16-bit integers into a 128-bit register, or four 32-bit integers into a 128-bit register, or two 64-bit integers into a 128-bit register; and for all of these cases you'd be doing 128-bit operations regardless of what size the individual integers are.

However, efficiency isn't the only important thing, and using "un-fixed sized integers" (like unsigned int) means that you need a pollute your code with macros/#define that make it hard to read (in an "Oh darn, I need to break my concentration and go track down some random piece of noise buried in a header file somewhere just to see what FOO actually is" way), while a fixed size integer type (e.g. uint32_t) will avoid that. Specifically, I would use (and have used) uint32_t without caring so much about performance.

You could say that nowadays we have RAMs that are big enough and the gain of bit arrays over boolean arrays is minimal but that's not the point here.

You could say that RAM is huge and relatively slow and caches are small and relatively fast, and performance requires that cache misses be minimised (to improve the effectiveness of caches and reduce the use of relatively slow RAM) by packing the most data into the least space. ;)

score 0 · Answer 6 · answered Dec 23 '18 at 21:28

0

You are right. It is common to use char or unsigned char for bit arrays. The reasoning behind this is purely efficiency related. char only reserves 1 Byte (8 bits) of your memory whereas int usually needs 4 Bytes (32 bits, this varies depending on your system and compiler) You do the math. You only need to store a single bit, so which one is more efficient to use?

answered Dec 23 '18 at 21:28

TheEngineer

792
6
18

1

I think the asker is talking about storing multiple bits per int, rather than just using the lowest bit. – Nick ODell Dec 23 '18 at 21:30
I think the asker is trying to make an array which stores single bit data per element and is trying to figure out which data type to use for that array. But if you are right and I am wrong, I apologize – TheEngineer Dec 23 '18 at 21:33

What data type to use in a bit array

6 Answers6