Save list of numbers to (binary) file with defined bits per number

Question

I have a list/array of numbers, which I want to save to a binary file. The crucial part is, that each number should not be saved as a pre-defined data type. The bits per value are constant for all values in the list but do not correspond to the typical data types (e.g. byte or int).

import numpy as np

# create 10 random numbers in range 0-63
values = np.int32(np.round(np.random.random(10)*63));

# each value requires exactly 6 bits
# how to save this to a file?

# just for debug/information: bit string representation
bitstring = "".join(map(lambda x: str(bin(x)[2:]).zfill(6), values));
print(bitstring)

In the real project, there are more than a million values I want to store with a given bit dephts. I already tried the module bitstring, but appending each value to the BitArray costs a lot of time...

The `PackedIntArray` class in [this answer](https://stackoverflow.com/a/29907689/355230) of mine might be useful (although it's implemented using `BitArray`). Regardless, there's a fair amount processing involved in doing what you want—especially for a long list of numbers—and there's no getting around it. `BitArray` is written in C, so should be close to as fast as you're going to get without writing your own custom C extension. — martineau, Jul 27 '17 at 16:10

martineau · Accepted Answer · 2017-07-28T11:51:47.043

The may be some numpy-specific way that make things easier, but here's a pure Python (2.x) way to do it. It first converts the list of values into a single integer since Python supports int values of any length. Next it converts that int value into a string of bytes and writes it to the file.

Note: If you're sure all the values will fit within the bit-width specified, the array_to_int() function could be sped up slightly by changing the (value & mask) it's using to just value.

import random

def array_to_int(values, bitwidth):
    mask = 2**bitwidth - 1
    shift = bitwidth * (len(values)-1)
    integer = 0
    for value in values:
        integer |= (value & mask) << shift
        shift -= bitwidth
    return integer

# In Python 2.7 int and long don't have the "to_bytes" method found in Python 3.x,
# so here's one way to do the same thing.
def to_bytes(n, length):
    return ('%%0%dx' % (length << 1) % n).decode('hex')[-length:]

BITWIDTH = 6
#values = [random.randint(0, 2**BITWIDTH - 1) for _ in range(10)]
values = [0b000001 for _ in range(10)]  # create fixed pattern for debugging
values[9] = 0b011111  # make last one different so it can be spotted

# just for debug/information: bit string representation
bitstring = "".join(map(lambda x: bin(x)[2:].zfill(BITWIDTH), values));
print(bitstring)

bigint = array_to_int(values, BITWIDTH)
width = BITWIDTH * len(values)
print('{:0{width}b}'.format(bigint, width=width))  # show integer's value in binary

num_bytes = (width+8 - (width % 8)) // 8  # round to whole number of 8-bit bytes
with open('data.bin', 'wb') as file:
    file.write(to_bytes(bigint, num_bytes))

I did not know that python int values can have any length. Saving the whole array in a single int is a really clever way. Thank you :-) — Alexander, Jul 28 '17 at 07:58
Alexander: Yes, in Python `int`s are variable length, like strings, and that feature can sometimes be useful in totally unexpected ways. Think of it generally as a way to allow a bit sequence of any length to be treated as a single number or quantity—which will put you in a position to wield yet another of the language's many secret sauces. `;-)` — martineau, Jul 28 '17 at 11:41

score 0 · Answer 2 · answered Jul 27 '17 at 16:09

Since you give an example with a string, I'll assume that's how you get the results. This means performance is probably never going to be great. If you can, try creating bytes directly instead of via a string.

Side note: I'm using Python 3 which might require you to make some changes for Python 2. I think this code should work directly in Python 2, but there are some changes around bytearrays and strings between 2 and 3, so make sure to check.

byt = bytearray(len(bitstring)//8 + 1)
for i, b in enumerate(bitstring):
    byt[i//8] += (b=='1') << i%8

and for getting the bits back:

bitret = ''
for b in byt:
    for i in range(8):
        bitret += str((b >> i) & 1)

For millions of bits/bytes you'll want to convert this to a streaming method instead, as you'd need a lot of memory otherwise.

Save list of numbers to (binary) file with defined bits per number

2 Answers2