I am having problem with how to convert huffman encoding string to binary python.
This question involves nothing of the huffman algorithm.
It is like this:
I can get an encoded huffman string, say 01010101010
. Note, it is a string.
But now I want to save the string representation into real binary.
In the huffman encoded string, every 0 and 1 is a byte.
What I want is every 0 and 1 is a bit.
How can I do that in python?
Edit 1:
Please forgive I did not describe my problem clear enough.
Let me explain my current approach of writing to zeros and ones to binary.
Say, we can a code string s='010101010'.
- I use
int
to convert it to integer - Then use
unichr
to convert it to string so that I can write it to file - write the string to file in binary mode
Also to be noted, I need to read the file in order to decode the huffman code.
So my approach is,
- read the bytes from file
- restore them to int
- convert the int to their binary representation string.
- decode the string
And at step 2, the problem happens and I became clueless.
As some huffman string can be short(like, 10
), while some can be long(010101010101001
). This results in their different byte length in their int value(
some short string may take just one byte,while long ones can take two or even more
)
The following code illustrates my problem:
ss=['010101','10010101010']
# first one is short and takes only one byte in its int value
# second one is long and takes two bytes
print 'write it to file'
with open('binary.bin','wb') as f:
for s in ss:
n=int(s,2)
print n
s=unichr(n)
f.write(s)
print 'read it to file'
with open('binary.bin','rb') as f:
for s in f.read():
print ord(s)
I am reading one byte a time in the second with part, but this is actually not correct. Because string 10010101010
takes up two bytes.
So, when I read those bytes from the file, How many bytes should I read at once?