3

Lately I have been working with large arrays that are full of boolean values. Currently, I store them in the .bss section with a .space directive, which allows me to create arrays of bytes. However, as I only need to store boolean values, I wish to read and write data bit-by-bit to and from an array.

Currently, the best way I can think of doing this is to have a .space directive with 1/8th the required storage, and use the ((1 << k) & n) formula to get and set individual bits, where k is the bit and n is the data, however this seems rather clunky, and I wish to know if there is a more elegant solution? Thanks. (Preferably in AT&T syntax)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Alice F
  • 435
  • 5
  • 13
  • That sounds like the normal way to do it. I don't think there are any bit-array instructions. – Barmar Jul 20 '21 at 07:40
  • As a bit-array or as a byte array, depending on how whether the smaller cache footprint is worth a couple extra instructions to compute the byte or dword index to load into a register for a `bt reg,reg`. Byte arrays are even better for writing (unless that causes more cache misses) because you don't need to read the surrounding bools, just store a whole byte. – Peter Cordes Jul 20 '21 at 07:40
  • Ok then, I guess for now my solution is the normal way. – Alice F Jul 20 '21 at 07:43
  • 3
    @Barmar: x86 actually *does* have bit-array instructions (e.g. `bts [rdi + rcx], eax` indexes memory at `rdi + rcx + eax/8` and tests+sets the bit at eax%8. But they're slow so you normally want to avoid them, except in the reg,reg forms after doing the same indexing manually. https://www.felixcloutier.com/x86/bt – Peter Cordes Jul 20 '21 at 07:43
  • @PeterCordes I think if you actually do need the bitstring semantics, the latency isn't really that bad. I mean it's still slower than doing it by hand, but do an extra 4 or 5 cycles per access really matter if L2 cache latency is part of the equation? – fuz Jul 20 '21 at 08:12
  • @fuz: It's 10 uops vs. maybe 4 or 5 for doing it manually, so if OoO exec can hide your L2 hit latency then you're losing front-end throughput (and costing extra OoO window size) by using `bt [mem], reg` or one of the RMW versions. – Peter Cordes Jul 20 '21 at 08:20
  • 1
    @PeterCordes Indeed you do. But it's really not extremely bad and might well be worth the simplified code as a beginner. – fuz Jul 20 '21 at 08:31
  • Where the bitstring instructions truly shine is atomic accesses (`lock bts [bitstring],rcx`, etc). – Brendan Jul 20 '21 at 09:47

1 Answers1

4

For sparse bit sets (all true or all false with a few exceptions), you could store a set of indices, using any set data structure, including a hash table. You can of course manually implement any algorithm in asm, just like you could in C. There are probably some more specialized data structures suited for various purposes / use-cases.


For a "normal" array of bools, your two main options are

  • Unpacked 1 bool per byte, with value 0 / 1 like C bool arr[size]
    (in .bss or dynamically allocated, wherever you want to put it, same as any byte array).

    Takes 8x the space (and thus cache footprint) of a packed bit array, but very easy to use. Especially efficient for random-access, especially writing because you can store a byte without disturbing its neighbours. (Don't have to read/modify/write the containing byte or dword).

    Besides cache footprint leading to more cache misses if it plus the rest of your data doesn't fit in whatever level of cache, the lower density is also bad for searching, popcounting, copying, or setting/clearing a range of elements.

    Instead of 0 / 1, you could allow 0 / non-0 if that would save instructions in code that writes the array. But that could cost instructions when reading if you want to compare two elements, or count true values or whatever. Note that most C/C++ ABIs strictly use 0 / 1 bytes for bool, and passing a bool holding a 2 to a C function could make it crash.

  • Packed 1 bool per bit, like C++ std::vector<bool>. (Except you can of course store it anywhere you want, unlike std::vector that always dynamically allocates).

    Howard Hinnant's article On vector<bool> discusses some of the things a bit-array is good at (with an appropriately optimized implementation), e.g. searching for a true can check a whole chunk at a time, e.g. 64 bits at a time with qword searches, or 256 bits at a time with AVX vptest. (Then tzcnt or bsf when you find a non-zero chunk, more or less the same as with byte elements: Efficiently find least significant set bit in a large array?). So 8x faster than with a byte array (even assuming equal cache hits), except for some extra work if vectorizing with SIMD, to find bit-within-element after finding the right byte or dword within a vector. vs. a byte array just vpslld $7, %ymm0, %ymm0 and vpmovmskb %ymm0, %eax / bsf %eax,%eax to convert the bytes to a bitmap and search it.


x86 bit-array aka bitstring instructions: slow with mem operands

x86 does have bit-array instructions like bt (Bit Test) and bts (Bit Test and Set), also Reset (clear) and Complement (flip), but they're slow with a memory destination and a register bit-index; it's actually faster to manually index the right byte or dword and load it, then use bts %reg,%reg and store the result. Using bts assembly instruction with gcc compiler

# fast version:
# set the bit at index n (RSI) in bit-array at RDI
   mov  %esi, %edx           # save the original low bits of the index
   shr  $5, %rsi             # dword index = bit-index / 8 / 4
   mov  (%rdi, %rsi, 4), %eax   # load the dword containing the bit
   bts  %edx, %eax           # eax |= 1 << (n&31)   BTS reg,reg masks the bit-index like shifts
   mov  %eax, (%rdi, %rsi, 4)   # and store it back

This effectively splits the bit-index into a dword-index and bit-within-dword index. The dword index is calculated explicitly with a shift (and turned back into a byte offset to an aligned dword with the scaled-index addressing mode). The bit-within-dword index is calculated implicitly as part of how bts %reg,%reg masks the count.

(If your bit-array is definitely smaller than 2^32 bits (512 MiB), you can save a byte of code size by using shr $5, %esi, discarding the high 32 bits of the bit-index.)

This leaves a copy of the old bit in CF, in case you care. bts reg,reg is single-uop on Intel, 2 uops on AMD, so definitely worth it vs. manually doing mov $1, %reg / shl / or.

That's only 5 uops on modern Intel CPUs (https://uops.info/ and https://agner.org/optimize/), vs. 10 uops for bts %rsi, (%rdi) which does exactly the same thing (but without needing any tmp registers).

You'll notice I used only dword chunks, not like in C where you'd often see code using unsigned long or uint64_t chunks so search could go fast. But in asm there's zero problem using different sized accesses to the same memory, except for store-forwarding stalls if you do a narrow store and then a wide load. Narrower RMW operations are actually better because it means operations on different bits can be closer together without actually creating a false dependency. If that's a major concern, you can even use byte accesses, except that bts and friends only go down to 16-bit word operand-size, so you'd have to manually and $7, %reg to extract the bit-within-byte from the bit-index.

e.g. like so for reading:

# byte chunks takes more work:
   mov  %esi, %edx      # save the original low bits
   shr  $3, %rsi        # byte index = bit-index / 8
   movzbl (%rdi, %rsi), %eax   # load the byte containing the bit
   and  $7, %edx
   bt   %edx, %eax      # CF = eax & 1 << (n&7)
   # mov  %al, (%rdi, %rsi)   if you used BTS, BTR, or BTC 

Byte loads are best done with movzx (aka AT&T movzbl) to avoid writing a partial register.

16-bit access width is the best best tradeoff

bts works for 16, 32, and 64-bit operand-size, with 16-bit being the narrowest size where we can still get masking of the count for free to get bit-within-word. movzx loads on most CPUs are just as cheap as 32-bit loads.

BMI2 shrx allows a copy-and-shift, but needs the count in another register. In a use-case like a Sieve of Eratosthenes where you're doing a lot of bit accesses, it's worth using one instruction outside the outer loop to save an instruction in the inner loop. (This is orthogonal to whether you use 32-bit or 16-bit chunks.)

; NASM syntax
; ahead of a loop
    mov     r13d, 4

; inside a loop
; set the bit at index n (ESI) in bit-array at RDI
    shrx    eax, esi, r13d      ; eax = j >> 4 = index of aligned word containing the bit.  Byte offsets would be worse, store-forwarding stalls as well as misalignment
    movzx   edx, word [rdi+rax*2]
    bts     dx, si              ; BTS reg,reg is fast and does the mod 16 part for free.
    mov     [rdi+rax*2], dx

That's from a Sieve of Eratosthenes I was playing around with in NASM for x86-64 Linux (https://godbolt.org/z/vcz969bPa for a basic bit-array including even numbers, https://godbolt.org/z/Ee8j1hz1x for a bitmap of only the odd numbers, thus an extra right-shift count. Surprisingly not faster for many cases, perhaps only at cutoff points between cache sizes. Neither makes any effort to set multiple bits at a time with SIMD for small primes, or for better locality). I may eventually post an answer on this code-review Q&A where I commented

Using word operand-size in my sieve gave a significant speedup, like 20% maybe for some smaller problem sizes IIRC, especially for the more densely packed bitmap (omitting even numbers) where setting every 3rd bit would mean a big chain of about 32/3 = 10.6 store/reloads for every dword before moving on to the next. Using narrower chunks allows about twice as much memory-level parallelism. (Although for larger strides, every access was to a different dword anyway, but with a sieve, the smaller the stride the more total accesses with that stride.)


Thread-safe atomics

If you needed to atomically set a bit (e.g. in a multi-threaded program), you could lock bts %reg, mem, or you could generate 1<<(n&31) in a register and lock or %reg, mem if you don't care what the old value is. lock bts is slow and microcoded anyway, so if you need atomicity you should probably just use it instead of trying to avoid the crazy-CISC bit-array semantics with lock or.

In the multi-threaded case, there's even more reason to consider using 1 bool per byte so you can use just plain movb $1, (%rdi, %rsi) (which is guaranteed atomic and doesn't disturb its neighbours: Can modern x86 hardware not store a single byte to memory?), rather than an atomic RMW.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847