1

I'd like to use a compact & fast large bitfield in Python, ideally with no or few dependencies other than numpy. The operations I'd like would be roughly equivalent to:

bits = new_bitfield(3000000)  # 3 million bits
bits.set_bit(n, 1)
bits.set_bit(n, 0)
bits.get_bit(n)

I'd like the underlying storage for bits to be very condensed/low-overhead. (Ideally, that bits object will only take a hairs'-breadth over 366.21 Kibibytes.)

I'd like the gets and sets to be very fast, with a minimum of type-checking/type-coercing overhead - perhaps even especially fast when using Cython- or Numba- specific code (or their respective inlining options).

What's the best way to approach C speed/compactness, while retaining as Pythonic of an outward appearance as is possible?

gojomo
  • 52,260
  • 14
  • 86
  • 115
  • 1
    "What's the best way to approach C speed/compactness, while retaining as Pythonic of an outward appearance as is possible?" Write a C-extension – juanpa.arrivillaga Jul 08 '20 at 23:30
  • 1
    It depends what you want to do with them. eg., implementing it in Numba should be quite easy (just syntax changes to that https://stackoverflow.com/a/30590727/4045774). But this only really makes sense if you call this from a function which is also compiled. – max9111 Jul 09 '20 at 09:43

1 Answers1

1

It should not be too hard to roll it out by yourself, another option is to use existing implementations. For better or worse, std::vector<bool> is exactly what you want: it uses exactly 1 bit per value (thus bool-template parameter is somewhat misleading, as a bool is at least 1 byte long).

Using Cython it could look something like this (it is compiled as a c++-extension):

%%cython -+
from libcpp.vector cimport vector
from libcpp cimport bool

cdef class Bitset:
    cdef vector[bool] bset;
    
    def __cinit__(self, size_t size):
        self.bset.resize(size, False);
        
    cpdef void set_bit(self, size_t pos, bint val) except *:        
        # self.bset[pos] = val would not check out of range
        # self.bset.at(pos) = val doesn't work with Cython
        if pos < self.bset.size():
            self.bset[pos] = val;
        else:
            raise IndexError("out of range access")    
            
    cpdef bint get_bit(self, size_t pos):
        return self.bset.at(pos)

Which can be used as

mybitset = Bitset(10)
mybitset.set_bit(2, True)
mybitset.get_bit(1), mybitset.get_bit(2) #returns (False, True)

Also

mybitset.set_bit(11, True) #throws
mybitset.get_bit(12, True) #throws

throw and don't end in undefined behavior.

Obviously, to keep the overhead at minimum, the code which uses Bitset-cdef-class should be written in Cython as well thus, cdef-part of the interface can be used, without the need of conversion to Python-object, which is needed for the Python-part of the interface (which is demonstrated above).

ead
  • 32,758
  • 6
  • 90
  • 153