7

I am trying to convert a bit string into a byte string, in Python 3.x. In each byte, bits are filled from high order to low order. The last byte is filled with zeros if necessary. The bit string is initially stored as a "collection" of booleans or integers (0 or 1), and I want to return a "collection" of integers in the range 0-255. By collection, I mean a list or a similar object, but not a character string: for example, the function below returns a generator.

So far, the fastest I am able to get is the following:

def bitsToBytes(a):
    s = i = 0
    for x in a:
        s += s + x
        i += 1
        if i == 8:
            yield s
            s = i = 0
    if i > 0:
        yield s << (8 - i)

I have tried several alternatives: using enumerate, bulding a list instead of a generator, computing s by "(s << 1) | x" instead of the sum, and everything seems to be a bit slower. Since this solution is also one of the shortest and simplest I found, I am rather happy with it.

However, I would like to know if there is a faster solution. Especially, is there a library routine the would do the job much faster, preferably in the standard library?


Example input/output

[] -> []
[1] -> [128]
[1,1] -> [192]
[1,0,0,0,0,0,0,0,1] -> [128,128]

Here I show the examples with lists. Generators would be fine. However, string would not, and then it would be necessary to convert back and foth between list-like data an string.

  • 1
    @AdamSmith You are right, I really mean list, though I think the usual wording is "bit string" (maybe I'm wrong!). I have updated the question to clarify this. –  Feb 06 '15 at 17:18
  • 1
    Why would [1,1] output [192] rather than [3] and [1, 0, 0, 0, 0, 0, 0, 0, 1] output [128, 128] rather than [1, 1]? – CodeMonkey Feb 06 '15 at 17:22
  • @CodeMonkey I fill bytes from high order bit to low order bit, with padding zeros added to the last byte (thus added in lower order bits). Hence, a 1 alone is understood as byte 0b10000000, two successive 1s as byte 0b11000000, and [1,0,0,0,0,0,0,0,1] is understood as bytes [0b10000000, 0b1000000]. By the way, it's the same as bit order in [PBM images](http://netpbm.sourceforge.net/doc/pbm.html). –  Feb 06 '15 at 17:26
  • why a bytestring is not an acceptable result? Enumerating a bytestring such as `b'\x80\x80'` yields integers in Python 3 i.e., `bytes` object is a sequence of bytes (Python `int` in `range(0x100)`). – jfs Feb 06 '15 at 18:14
  • For the best performance, you could write a C extension in Cython: `a2b_bin(): b'1' -> b'\x80'` similar to how [`b2a_bin(): b'\x80' -> b'10000000'` is written](http://stackoverflow.com/questions/7396849/convert-binary-to-ascii-and-vice-versa-python#comment29803972_7396849) – jfs Feb 06 '15 at 18:28

3 Answers3

6

The simplest tactics to consume bits in 8-er chunks and ignore exceptions:

def getbytes(bits):
    done = False
    while not done:
        byte = 0
        for _ in range(0, 8):
            try:
                bit = next(bits)
            except StopIteration:
                bit = 0
                done = True
            byte = (byte << 1) | bit
        yield byte

Usage:

lst = [1,0,0,0,0,0,0,0,1]
for b in getbytes(iter(lst)):
    print b

getbytes is a generator and accepts a generator, that is, it works fine with large and potentially infinite streams.

martineau
  • 119,623
  • 25
  • 170
  • 301
georg
  • 211,518
  • 52
  • 313
  • 390
4

Step 1: Add in buffer zeros

Step 2: Reverse bits since your endianness is reversed

Step 3: Concatenate into a single string

Step 4: Save off 8 bits at a time into an array

Step 5: ???

Step 6: Profit

def bitsToBytes(a):
    a = [0] * (8 - len(a) % 8) + a # adding in extra 0 values to make a multiple of 8 bits
    s = ''.join(str(x) for x in a)[::-1] # reverses and joins all bits
    returnInts = []
    for i in range(0,len(s),8):
        returnInts.append(int(s[i:i+8],2)) # goes 8 bits at a time to save as ints
    return returnInts
CodeMonkey
  • 1,795
  • 3
  • 16
  • 46
3

Using itertools' grouper()` recipe:

from functools import reduce
from itertools import zip_longest

def grouper(iterable, n, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx"
    args = [iter(iterable)] * n
    return zip_longest(*args, fillvalue=fillvalue)

bytes = [reduce(lambda byte, bit: byte << 1 | bit, eight_bits)
         for eight_bits in grouper(bits, 8, fillvalue=0)]

Example

[] -> []
[1] -> [128]
[1, 1] -> [192]
[1, 0, 0, 0, 0, 0, 0, 0, 1] -> [128, 128]

If input is a string then a specialized solution might be faster:

>>> bits = '100000001'
>>> padded_bits = bits + '0' * (8 - len(bits) % 8)
>>> padded_bits
'1000000010000000'
>>> list(int(padded_bits, 2).to_bytes(len(padded_bits) // 8, 'big'))
[128, 128]

The last byte is zero if len(bits) % 8 == 0.

jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • Thanks for this very interesting "grouper", and for int.to_bytes that I didn't even know! Actually, your first solution is here a bit slower, but your second is more than 30% faster than mine, with a one-liner. And you are right in your comment above, it's fine to return a byte string, they are very easy to use later for file io. Using a list as input, I do this `int("".join("01"[x] for x in data) + "0"*k, 2).to_bytes(n, "big")` where k and n are computed as in your example. –  Feb 06 '15 at 19:44
  • And even faster with `int("".join(map("01".__getitem__, data)) + "0"*k, 2).to_bytes(n, "big")` –  Feb 06 '15 at 19:48
  • @Jean-ClaudeArbaut: You could post an answer with time comparisons and the `''.join`-based solution. Note: `grouper()`-based solution may convert infinite bits stream into infinite bytes stream (just replace `[]` with `()` to convert a list into a generator) -- its purpose to accept arbitrary *"collections"* of bits. Different solutions may be faster for different types of input (bits sequence, iterable, a bytestring), and input sizes (small, large). – jfs Feb 06 '15 at 19:52