5

I have the following string that I would like to Huffman-encode and store efficiently into a bit array:

>>> print sequence
GTCAGGACAAGAAAGACAANTCCAATTNACATTATG|

The frequencies of the symbols in sequence are:

>>> print freqTuples
[(0.40540540540540543, 'A'), (0.1891891891891892, 'T'), (0.16216216216216217, 'C'), (0.16216216216216217, 'G'), (0.05405405405405406, 'N'), (0.02702702702702703, '|')]`

I translate this into a Huffman code dictionary:

>>> print codeDict
{'A': '1', 'C': '010', 'G': '001', 'N': '0110', 'T': '000', '|': '0111'}

I then used the Python bitstring package to translate the string, character by character, into an instance of the BitArray class, which I call bitArray, which contains bits for each character encoded with its respective Huffman code:

>>> print bitArray.bin
0b001000010100100110101100111100110101101100000100101100000001101010100000010000010111

Here is the bit array in bytes:

>>> print bitArray.tobytes()
!I\254\363[^D\260^Z\240Ap

I must use tobytes() instead of bytes, as the bit array I generate does not divide evenly into 8-bit segments.

When I calculate the storage efficiency of the BitArray representation (the ratio of the sizes of the bit array and the input string), I get worse performance than if I had left the input string unencoded:

>>> sys.getsizeof(bitArray.tobytes()) / float(len(sequence))
1.2972972973

Am I measuring storage efficiency correctly? (If I encode longer input strings, this ratio improves, but it seems to approach an asymptotic limit of around 0.28. I'd like to confirm if this is the right way to measure things.)

Edit

The following two approaches yield different answers:

>>> print len(bitArray.tobytes()) / float(len(mergedSequence))
0.297297297297

>>> print bitArray.len / (8.*len(mergedSequence))
0.283783783784

I'm not sure which to believe. But in the process of writing data to storage, I think I would need the byte representation, which makes me inclined towards choosing the first result.

Alex Reynolds
  • 95,983
  • 54
  • 240
  • 345
  • It depends on whether the goal is to estimate the storage efficiency for arbitrary-length strings or the storage efficiency for your specific 37-character string. If it's the latter, .297 is the correct answer. If you're looking the more general result, .283 is probably closer to the result you'd get with either method for much longer strings. The unused 0-7 bits at the end of the bitstring becomes insignificant as the total length of the string increases. – Dave Nov 08 '11 at 00:41
  • About your last comment. `(8*11) / (8*37) = 0.297297297297` and `84 / (8*37) = 0.283783783784` – ypercubeᵀᴹ Nov 08 '11 at 00:41
  • Regarding your edit, the answer is that both are basically correct. Basically, a short string won't be a good proxy for the compression you'll get in a long string, because there isn't enough info to actually choose the most efficient codes for the true ratio of the symbols in the datastream. – agf Nov 08 '11 at 00:43
  • I realize that a short string won't give me a good answer — I want to make sure I fully understand how to calculate efficiency within the Python framework, so that I can rely on the answer I get when I scale up or test other methods. – Alex Reynolds Nov 08 '11 at 01:17
  • Sorry to resuscitate a long dead thread, but you wanted `sum((ord(c).bit_length() for c in sequence))` instead of `float(len(sequence))`, as it gets the length in bits, not just the length of the printable representation. – Marc Laugharn Jun 20 '13 at 03:32

3 Answers3

2

I'm not really sure about the bitarray stuff, but shouldn't you just be able to do:

>>> len(bitArray.tobytes()) / float(len(sequence))

I'm not saying that will solve your problem, but it could be that the "getsizeof" thing (again, something I'm not really all that familiar with) is throwing you off.

From what you've written up there, it kind of looks like you're comparing apples to oranges a bit.

agf
  • 171,228
  • 44
  • 289
  • 238
  • I think you are right. From getsizeof documentation: *getsizeof() calls the object’s __sizeof__ method and adds an additional garbage collector overhead if the object is managed by the garbage collector* http://docs.python.org/dev/library/sys.html – enticedwanderer Nov 07 '11 at 23:59
2
>>> sys.getsizeof(bitArray.tobytes()) / float(len(sequence))
1.2972972973

Implies that the encoded version is 30% longer than the original sequence.

I don't think you want to use getsizeof here -- if you want to minimize the size of the Python object, you should be using getsizeof(sequence) as well, rather than len.

If instead, you want to do what Huffman coding is meant to do, and minimize the binary representation, then you want to use len on both (assuming the sequence is represented as one-byte-per-character).

So, your real ratio is 11 / 37.

I assume you're using Huffman coding as an exercise, as this doesn't seem like a logical way to efficiently store what is just a four-bit code with a termination character. At least it would be better to use arithmetic coding, which will allow you to use base-5 encoding instead of base-2, which is optimal for 5 possible characters.

Really, I would assume in a sequence long enough to be worth compressing, there is a known ratio of G:A:C:T and / or fixed length 2-bit encoding will be just as efficient (the ratios approach 1:1:1:1) since you don't really need to encode the termination character.

agf
  • 171,228
  • 44
  • 289
  • 238
  • I'm not sure that the ratios approach 1:1:1:1 for real data. Any link? – ypercubeᵀᴹ Nov 07 '11 at 23:55
  • I assume the ratio is known, with one trivial example being 1:1:1:1 which has a very simple optimal encoding, not that 1:1:1:1 is the ratio. – agf Nov 07 '11 at 23:58
  • I thought I noted that I was getting worse performance by encoding, than from the original string. Also, the frequencies of the symbols I am working with will not be 1:1:1:1, and I may have to deal with IUPAC symbols other than GACT (such as N), as well as a terminator character between sequences, since I can't assume the length of the sequence ahead of time. Any other ideas? – Alex Reynolds Nov 08 '11 at 00:03
  • According to Python documentation (http://docs.python.org/dev/library/sys.html#sys.getsizeof ) it looks like `sys.getsizeof` should return the size of the object in bytes. Why would it not work correctly here, when I give it the byte representation of the bit array? – Alex Reynolds Nov 08 '11 at 00:06
  • @AlexReynolds [Arithmetic coding](https://secure.wikimedia.org/wikipedia/en/wiki/Arithmetic_coding) will be more efficient if there are not 2^n symbols (and equally efficient if there are). Also, have you tried just [zipping](http://docs.python.org/library/zipfile.html)? – agf Nov 08 '11 at 00:07
  • 1
    Because the `sizeof` is the size of the entire object. A zero character string doesn't have size zero -- there is overhead. As I mentioned in my answer, if you're trying to make the Python object smaller, you want to use `sizeof` on both sides, and if you want to minimize the binary representation of the data, for storage to disk or transmission or something, then use `len` for both. – agf Nov 08 '11 at 00:10
1

You know that the answer is wrong, because the Huffman dictionary is less than 4 bits per character, so the real answer must be less than .5. If the dictionary and character frequency doesn't change for longer strings, then the compression ratio shouldn't decrease toward an asymptotic limit as the string gets longer.

From the documentation of sys:

"getsizeof() calls the object’s __sizeof__ method and adds
 an additional garbage collector overhead if the object is
 managed by the garbage collector."

You need a function that will return the length of the bitstring itself, not the bitstring + overhead. The BitString documentation says that the len or length property returns the length in bits. So try doing:

bitArray.len / 8.*len(sequence)
Dave
  • 3,834
  • 2
  • 29
  • 44
  • He's not running it on the bitarray, but the bytes returned by `.tobytes()` (correctly, because that's how it will have to be stored anyway), so this is wrong -- as I said in my answer, it should just be `len` for both. – agf Nov 08 '11 at 00:12
  • 1
    For short strings, the length in bits will more closely approximate the compression ratio obtained for long strings, where the padding overhead of 0-7 bits at the end becomes a negligible portion of the total. – Dave Nov 08 '11 at 00:37