I have the following string that I would like to Huffman-encode and store efficiently into a bit array:
>>> print sequence
GTCAGGACAAGAAAGACAANTCCAATTNACATTATG|
The frequencies of the symbols in sequence
are:
>>> print freqTuples
[(0.40540540540540543, 'A'), (0.1891891891891892, 'T'), (0.16216216216216217, 'C'), (0.16216216216216217, 'G'), (0.05405405405405406, 'N'), (0.02702702702702703, '|')]`
I translate this into a Huffman code dictionary:
>>> print codeDict
{'A': '1', 'C': '010', 'G': '001', 'N': '0110', 'T': '000', '|': '0111'}
I then used the Python bitstring
package to translate the string, character by character, into an instance of the BitArray
class, which I call bitArray
, which contains bits for each character encoded with its respective Huffman code:
>>> print bitArray.bin
0b001000010100100110101100111100110101101100000100101100000001101010100000010000010111
Here is the bit array in bytes:
>>> print bitArray.tobytes()
!I\254\363[^D\260^Z\240Ap
I must use tobytes()
instead of bytes
, as the bit array I generate does not divide evenly into 8-bit segments.
When I calculate the storage efficiency of the BitArray
representation (the ratio of the sizes of the bit array and the input string), I get worse performance than if I had left the input string unencoded:
>>> sys.getsizeof(bitArray.tobytes()) / float(len(sequence))
1.2972972973
Am I measuring storage efficiency correctly? (If I encode longer input strings, this ratio improves, but it seems to approach an asymptotic limit of around 0.28. I'd like to confirm if this is the right way to measure things.)
Edit
The following two approaches yield different answers:
>>> print len(bitArray.tobytes()) / float(len(mergedSequence))
0.297297297297
>>> print bitArray.len / (8.*len(mergedSequence))
0.283783783784
I'm not sure which to believe. But in the process of writing data to storage, I think I would need the byte representation, which makes me inclined towards choosing the first result.