1

I am having problem with how to convert huffman encoding string to binary python.

This question involves nothing of the huffman algorithm.

It is like this:

I can get an encoded huffman string, say 01010101010. Note, it is a string.

But now I want to save the string representation into real binary.

In the huffman encoded string, every 0 and 1 is a byte.

What I want is every 0 and 1 is a bit.

How can I do that in python?

Edit 1:

Please forgive I did not describe my problem clear enough.

Let me explain my current approach of writing to zeros and ones to binary.

Say, we can a code string s='010101010'.

  1. I use int to convert it to integer
  2. Then use unichr to convert it to string so that I can write it to file
  3. write the string to file in binary mode

Also to be noted, I need to read the file in order to decode the huffman code.

So my approach is,

  1. read the bytes from file
  2. restore them to int
  3. convert the int to their binary representation string.
  4. decode the string

And at step 2, the problem happens and I became clueless.

As some huffman string can be short(like, 10), while some can be long(010101010101001). This results in their different byte length in their int value( some short string may take just one byte,while long ones can take two or even more )

The following code illustrates my problem:

ss=['010101','10010101010'] 
# first one is short and takes only one byte in its int value
# second one is long and takes two bytes   

print 'write it to file'
with open('binary.bin','wb') as f:
    for s in ss:
        n=int(s,2)
        print n
        s=unichr(n)
        f.write(s)

print 'read it to file'
with open('binary.bin','rb') as f:
    for s in f.read(): 
        print ord(s)

I am reading one byte a time in the second with part, but this is actually not correct. Because string 10010101010 takes up two bytes.

So, when I read those bytes from the file, How many bytes should I read at once?

xiaohan2012
  • 9,870
  • 23
  • 67
  • 101
  • There is an early version of a Huffman coder I threw together at http://pastebin.com/QEK3WdbE which shows how I do it. – agf Aug 29 '11 at 02:51
  • 3
    http://packages.python.org/bitstring/functions.html#bitstring.pack – Marc B Aug 29 '11 at 02:51
  • What have you tried? This should be easy compared to the work you've already done... what isn't working? – Karl Knechtel Aug 29 '11 at 02:52
  • This is a duplicate of [convert binary representation string to binary in python](http://stackoverflow.com/questions/7213996/convert-binary-representation-string-to-binary-in-python) which you posted yesterday, where you accepted an answer. Why would you post it again? – agf Aug 29 '11 at 05:58
  • Please forgive, I did not articulate this one clearly. It is different from the previous post, actually. Please see the edited version of my question.@Karl Knechtel – xiaohan2012 Aug 29 '11 at 10:33

3 Answers3

2

There are two different "binary" representations in Python that you might want to use.

Bignum

One is a "bignum" or arbitrary-precision integer. This type is called long in Python 2.x and int in Python 3.x. As the name suggests, this representation is semantically an integer of arbitrary length, so it's useful if you plan to do arithmetic on the resulting number. To parse a string of binary digits, use

# Python 2
long(digit_str, 2)

or

# Python 3
int(digit_str, 2)

bitstring library

Alternatively, as Marc B suggested in the comments, use the bitstring library. Specifically, for conversion, use the bitstring.pack function.

For Huffman coding, using bitstring is probably preferable to storing data in a byte-string, since Huffman codes are generally not multiples of 8 bits; bitstring lets you manipulate bit-strings of arbitrary lengths. Disadvantage: bitstring is not part of the standard library.

Mechanical snail
  • 29,755
  • 14
  • 88
  • 113
2

One possible approach(using bitstring library) which makes some sense, but still contain incorrectness:

Use bitstring library(Thanks to Mechanical snail and Marc B )

For writing to file.

Steps:

  1. encode the plain text to binary representation string
  2. concatenate all those strings to form one longer one
  3. use bitstring.BitArray to convert to hex format
  4. write the hex string to a file

For reading:

  1. read the hex string from file
  2. convert it back to bit string using BitArray
  3. start decoding

Code:

ss=['01010100','10010101010','010101110101010101'] #encoded message


from bitstring import BitArray,BitStream
print 'write it to file'
with open('binary.bin','wb') as f:
    s=''.join(ss);
    b=BitArray(bin=s)                 
    f.write(b.tobytes())# thanks to Scott, tobytes() method is very useful

print 'read it to file'
b=BitArray(filename='binary.bin')
print b.bin
xiaohan2012
  • 9,870
  • 23
  • 67
  • 101
  • 1
    The problem you have here is that you are writing a hex *string* to your file, not binary data. That is you're writing the 12 byte string `0x0a92a95d55` to file instead of the 5 bytes `\n\x92\xa9]U`. You might as well just write `01010100...` to the file! I would advise using the `b.tobytes()` method to convert to bytes (which pads with zero bits up to the byte boundary). These extra bits at the end could presumably be ignored by the decoder as they won't decode to any symbol. It's also important to pad at the end rather than the beginning. – Scott Griffiths Aug 30 '11 at 15:02
  • 1
    So to write just use `f.write(BitArray(bin=''.join(ss)).tobytes())` and to read back you can either use `BitArray(filename='binary.bin')` or construct it from a read-only file object `BitArray(f)`. – Scott Griffiths Aug 30 '11 at 15:11
  • Excellent. Really thanks, @Scott Griffiths I think it solved my problem perfectly! – xiaohan2012 Aug 30 '11 at 23:23
  • @Scott Griffiths, the appending zeros actually caused some problems. It can be decoded to some symbol. At least in my case, it is so. – xiaohan2012 Aug 30 '11 at 23:50
  • 太感谢了,正在找这个。事实就是python没办法用内置的函数很好解决这个问题,居然有人觉得这问题很简单实在是无语 – laike9m Nov 21 '13 at 07:10
  • @ScottGriffiths I was having the same exact problem. It looks like your comment and package is the solution. Thank you. – Moondra Mar 15 '18 at 21:41
1

You have a string that you need to convert into a number. int takes an optional 'base' as an argument. So for the string in your example,

>>> int('01010101010', 2)
682

Once you have a number (not a string), it doesn't make sense to want "real" binary, since the number is the same, you can display it in any base. That means the binary 100 is the same number as the decimal 4, inside your program they're not different numbers. So, once you turn your string into a number you can fiddle with the bits in it.

Roshan Mathews
  • 5,788
  • 2
  • 26
  • 36
  • 2
    `'01'` and `'1'` are different codes. You need to encode the number of bits as well as the bit value. – agf Aug 29 '11 at 05:56
  • good point, I took "This question involves nothing of the huffman algorithm" at face value, also didn't notice the duplicate question from yesterday. – Roshan Mathews Aug 29 '11 at 06:45