2

As a follow up question related to my question regarding efficient way of storing huffman tree's I was wondering what would be the fastest and most efficient way of searching a binary tree (based on the Huffman coding output) and storing the path taken to a particular node.

This is what I currently have:

  • Add root node to queue
  • while queue is not empty, pop item off queue
    • check if it is what we are looking
      • yes: Follow a head pointer back to the root node, while on each node we visit checking whether it is the left or right and making a note of it.
      • break out of the search
    • enqueue left, and right node

Since this is a Huffman tree, all of the entries that I am looking for will exist. The above is a breadth first search, which is considered the best for Huffman trees since items that are in the source more often are higher up in the tree to get better compression, however I can't figure out a good way to keep track of how we got to a particular node without backtracking using the head pointer I put in the node.

In this case, I am also getting all of the right/left paths in reverse order, for example, if we follow the head to the root, and we find out that from the root it is right, left, left, we get left, left, right. or 001 in binary, when what I am looking for is to get 100 in an efficient way.

Storing the path from root to the node as a separate value inside the node was also suggested, however this would break down if we ever had a tree that was larger than however many bits the variable we created for that purpose could hold, and at that point storing the data would also take up huge amounts of memory.

Community
  • 1
  • 1
X-Istence
  • 16,324
  • 6
  • 57
  • 74
  • About storing the bitstring, can't you use a vector to store the bitstring? It stores the bits compactly, and can store any number of them. – newacct Apr 30 '09 at 18:20
  • "and at that point storing the data would also take up huge amounts of memory." Usually Huffman coding is used for encoding bytes (technically, symbols from some alphabet). So how can it get big? There are only 256 possible bytes. – newacct Apr 30 '09 at 18:25
  • @newacct: Sure, but a vector takes up 1 byte per bool, a std::bitset would be better. You make a good point about there only being 256 if using bytes, we have another assignment due that reads int's from disk, that leaves 2^32 possible options, at which point the memory requirements become really big, hence my question about the fastest way to retrieve a path from a binary tree that already exists. – X-Istence Apr 30 '09 at 20:29
  • About the vector, no actually it doesn't. a vector takes up 1 byte per 8 bools. And bitset won't work because its size has to be fixed at compile time, whereas your bitstrings have to be different lengths for every symbol at runtime. Since your Huffman tree fits in memory, I'm assuming that the number of actual symbols used in the encoding is small compared to 2^32. In that case you should be able to use some kind of trie, which takes memory proportional to the number of entries. – newacct Apr 30 '09 at 21:02

2 Answers2

2

Create a dictionary of value -> bit-string, that would give you the fastest lookup.

If the values are a known size, you can probably get by with just an array of bit-strings and look up the values by their index.

Lasse V. Karlsen
  • 380,855
  • 102
  • 628
  • 825
  • This still has the issue that the bit-string has to be stored, now we moved it from having the bit-string stored in a leaf node to a dictionary, with basically the same memory trade-off. – X-Istence Apr 30 '09 at 17:09
  • But you did ask for the fastest way, didn't you? – Lasse V. Karlsen Apr 30 '09 at 18:19
  • Sure, I absolutely did, but I asked specifically for a binary tree. Pre-computing the result was not allowed because I specifically excluded it with the last paragraph in my question. – X-Istence Apr 30 '09 at 20:28
0

If you're decoding Huffman-encoded data one bit at a time, your performance will be poor. As much as you'd like to avoid using lookup tables, that's the only way to go if you care about performance. The way Huffman codes are created, they are left-to-right unique and lend themselves perfectly to a fast table lookup.

BitBank
  • 8,500
  • 3
  • 28
  • 46