0

I'm writing a hash function to create hashes of some given size (e.g. 20 bits).

I have learnt how to write the hashes to files in a binary form (see my related question here), but now I would like to handle these hashes in Python (2.7) using the minimum memory allocation. Right now they are typed as int, so they are allocated 24 bytes each, which is huge for a 20 bits object.

How can I create a custom Python object of arbitrary size (e.g. in my case 3 bytes)?

martineau
  • 119,623
  • 25
  • 170
  • 301
jul
  • 36,404
  • 64
  • 191
  • 318

1 Answers1

3

You could do something like you want by packing the bits for each object into a packed array of bit (or boolean) values. There are a number of existing Python bitarray extension modules available. Implementing a higher level "array of fixed bit width integer values" with one is a relatively straight-forward process.

Here's an example based on one in pypi that's implemented in C for speed. You can also download an unofficial pre-built Windows version of it, created by Christoph Gohlke, from here.

Updated — Now works in Python 2.7 & 3.x.

from __future__ import print_function
# uses https://pypi.python.org/pypi/bitarray
from bitarray import bitarray as BitArray
try:
    from functools import reduce  # Python 3.
except:
    pass


class PackedIntArray(object):
    """ Packed array of unsigned fixed-bit-width integer values. """
    def __init__(self, array_size, item_bit_width, initializer=None):
        self.array_size = array_size
        self.item_bit_width = item_bit_width
        self.bitarray = BitArray(array_size * item_bit_width)
        if initializer is not None:
            try:
                iter(initializer)
            except TypeError:  # not iterable
                self.bitarray.setall(initializer) # set all to bool(initializer)
            else:
                for i in xrange(array_size):
                    self[i] = initializer[i]  # must be same length as array

    def __getitem__(self, index):
        offset = index * self.item_bit_width
        bits = self.bitarray[offset: offset+self.item_bit_width]
        return reduce(lambda x, y: (x << 1) | y, bits, 0)

    def __setitem__(self, index, value):
        bits = BitArray('{:0{}b}'.format(value, self.item_bit_width))
        offset = index * self.item_bit_width
        self.bitarray[offset: offset+self.item_bit_width] = bits

    def __len__(self):
        """ Return the number of items stored in the packed array.. """
        return self.array_size

    def length(self):
        """ Return the number of bits stored in the bitarray.. """
        return self.bitarray.length()

    def __repr__(self):
        return('PackedIntArray({}, {}, ('.format(self.array_size,
                                                    self.item_bit_width) +
               ', '.join((str(self[i]) for i in xrange(self.array_size))) +
               '))')


if __name__ == '__main__':
    from random import randrange

    # hash function configuration
    BW = 8, 8, 4  # bit widths of each integer
    HW = sum(BW)  # total hash bit width

    def myhash(a, b, c):
        return (((((a & (2**BW[0]-1)) << BW[1]) |
                    b & (2**BW[1]-1)) << BW[2]) |
                    c & (2**BW[2]-1))

    hashes = PackedIntArray(3, HW)

    print('hash bit width: {}'.format(HW))
    print('length of hashes array: {:,} bits'.format(hashes.length()))
    print()
    print('populate hashes array:')
    for i in range(len(hashes)):
        hashed = myhash(*(randrange(2**bit_width) for bit_width in BW))
        print('  hashes[{}] <- {:,} (0b{:0{}b})'.format(i, hashed, hashed, HW))
        hashes[i] = hashed
    print()
    print('contents of hashes array:')
    for i in range(len(hashes)):
        print(('  hashes[{}]: {:,} '
                '(0b{:0{}b})'.format(i, hashes[i], hashes[i], HW)))

Sample output:

hash bit width: 20
length of hashes array: 60 bits

populate hashes array:
  hashes[0] <- 297,035 (0b01001000100001001011)
  hashes[1] <- 749,558 (0b10110110111111110110)
  hashes[2] <- 690,468 (0b10101000100100100100)

contents of hashes array:
  hashes[0]: 297,035 (0b01001000100001001011)
  hashes[1]: 749,558 (0b10110110111111110110)
  hashes[2]: 690,468 (0b10101000100100100100)

Note: bitarray.bitarray objects also have methods to write and read their bits to and from files. These could be used to also provide similar functionality to the PackedIntArray class above.

martineau
  • 119,623
  • 25
  • 170
  • 301
  • Great, thank you. One detail: I guess the last argument of the myhash call should be randrange(16). – jul Apr 29 '15 at 05:59
  • Technically, yes, it should have been `randrange(16)`, but the third value passed to `myhash()` is masked to 4-bits, so it didn't cause a noticeable problem. Fixed it anyway and optimized `myhash()` function, as well. – martineau Apr 29 '15 at 09:00