3

I'm fairly new to coding and I'm having difficulty creating a huffman algorithm to encode and decode text files. I understand most of the concepts pretty well but there's not much on exactly how you would create and traverse the tree.

Here's my code so far:

with open(input('enter a file: ')) as name:
    fh = name.read()
    print(fh)

#create the frequency dicitonary
freqdict = {}
for ch in fh:
    if ch in freqdict:
        freqdict[ch] += 1
    else:
        freqdict[ch] = 1
freqdict = sorted(freqdict.items(), key = lambda x: 
x[1], reverse = True)
print(freqdict)

class Node:
    def __init__(self, left = None, right = None, 
data):
        self.left = left
        self.right = right
        self.data = data

    def children(self):
        return (self.left, self.right)

    def nodes(self):
        return (self.left, self.right)

    def __str__(self):
        return str(self.left, self.right)
  • 3
    There is no answerable question in what you've posted so far... What are you trying to achieve, why isn't the code you've posted doing what you want? (what _is_ it doing?) – thebjorn Mar 07 '19 at 21:00
  • 1
    @thebjorn its at the end: what are the steps for creating and traversing the tree? there isnt much out there that clearly describes it without just giving me a cut and past answer –  Mar 07 '19 at 21:10

2 Answers2

1

Modified version of : https://rosettacode.org/wiki/Huffman_coding#Python

This is a Huffman encoder/decoder for any message in 'txt'

This encodes the txt message into a shortened binary variable for storage ( You can store the compressed_binary to disk. You can also decode the compressed_binary using decompressHuffmanCode, which recreates the original string from the compressed string of compressed_binary

from heapq import heappush, heappop, heapify
from collections import defaultdict
from functools import reduce

def encode(symb2freq):
    heap = [[wt, [sym, ""]] for sym, wt in symb2freq.items()]
    heapify(heap)
    while len(heap) > 1:
        lo = heappop(heap)
        hi = heappop(heap)
        for pair in lo[1:]:
            pair[1] = '0' + pair[1]
        for pair in hi[1:]:
            pair[1] = '1' + pair[1]
        heappush(heap, [lo[0] + hi[0]] + lo[1:] + hi[1:])
    return dict(sorted(heappop(heap)[1:], key=lambda p: (p, len(p[-1]))))

# recreates the original message from your huffman code table 
# uncomment print(a) to see how it works
def decompressHuffmanCode(a, bit):
    # print(a)
    return ('', a[1] + s[a[0]+bit[0]]) if (a[0]+bit[0] in s) else (a[0]+bit[0], a[1])

txt="CompresssionIsCoolWithHuffman"

# Create symbol to frequency table
symb2freq = defaultdict(int)
for ch in txt:
    symb2freq[ch] += 1
enstr=encode(symb2freq)

# Create Huffman code table from frequency table
s=dict((v,k) for k,v in dict(enstr).items())

# Create compressible binary. We add 1 to the front, and remove it when read from disk
compressed_binary = '1' + ''.join([enstr[item] for item in txt])

# Read compressible binary so we can uncompress it. We strip the first bit.
read_compressed_binary = compressed_binary[1:]

# Recreate the compressed message from read_compressed_binary
remainder,bytestr = reduce(decompressHuffmanCode, read_compressed_binary, ('', ''))
print(bytestr)

which results in:

CompresssionIsCoolWithHuffman

This is a quick implementation that should help. Things that can be dealt for programmatically is the buffer, but i just wanted to show you a quick implementation using your frequency codes

oppressionslayer
  • 6,942
  • 2
  • 7
  • 24
0

I think it will be sufficient to use the python dictionary structure to represent trees and nodes. You don't really need a separate class for it.

You want to initialize all your nodes:

def huffman_tree(freq_dict):
    vals = freq_dict.copy()
    nodes = {}
    for n in vals.keys():
        nodes[n] = []

Here we have initialized a dictionary nodes to represent nodes and leafs. Let's populate it with data; within the same function:

    while len(vals) > 1:
        s_vals = sorted(vals.items(), key=lambda x:x[1]) 
        a1 = s_vals[0][0]
        a2 = s_vals[1][0]
        vals[a1+a2] = vals.pop(a1) + vals.pop(a2)
        nodes[a1+a2] = [a1, a2]

You see that I first now sort the data from our frequency-dictionary, ascending. You did it earlier, so you shouldn't have to do that now (although you did it descending). Though, doing it later rather than sooner, and within such a while-loop, lets you be more freely with which frequency-dictionary you pass to your program. What we're doing here furthermore, is that we're taking two and two items from the freq_dict when sorted, adding them together and storing them in the freq_dict.

Now we need to go through our freq_dict, and construct some sort of symbols-dictionary, representing a rules-set for exchanging text with symbols. Still within the same function:

    symbols = {} # this will keep our encoding-rules
    root = a1+a2 # a1 and a2 is our last visited data,
                 # therefore the two largest values
    tree = label_nodes(nodes, root, symbols)

    return symbols, tree

The line with tree = ... might seem like a bit of magic here, but that's because we haven't created the function yet. But imagine there's a function that recursively go through each node from the root to the leafs, adding a string-prefix of either '0' or '1' representing the encoded symbol (and that's why we sorted ascending, so that we got the most frequent word at the top, receiving the smallest encoded symbol):

def label_nodes(nodes, label, symbols, prefix = ''):        
    child = nodes[label]
    tree = {}
    if len(child) == 2:
        tree['0'] = label_nodes(nodes, child[0], symbols, prefix+'0')
        tree['1'] = label_nodes(nodes, child[1], symbols, prefix+'1')     
        return tree
    else: # leaf
        symbols[label] = prefix
        return label

This function does just that. And now we're ready to use it:

def huffman_encode(string, symbols):
    return ''.join([symbols[str(e)] for e in string])

text =  '''This is a simple text, made to illustrate how
        a huff-man encoder works. A huff-man encoder works
        best when the text is of reasonable length and has
        repeating patterns in its language.'''

fd = freq_dict(text)    
symbols, tree = huffman_tree(fd)    
huffe = huffman_encode(text, symbols)
print(huffe)

Out: 001001011001110100011011101000110110100100111101011000110011010010000011011000110000110110010011011100011010111010000011011110110111010100101001011101100111011111001010101100001110011101111110010011101010101010101011010011100111101100101001011111100110001101010000100010001111101110111110100001110001111100110111110011111100011111111101110000001110011110110010100101111110011000110101000010001000111110111011111010000111000111110011011111001111110001110011101010101010101010010000000011101101111100110010001000011011110010000110110001100001101101110100011011101100101011110000010100011110111000101000100010010000011001000010001111011011110010110101000111010011100110100011100111010101010101010111100000100110000101010111101010001111010110011010101011101100011100100000110111010100001110101011001101100101010100011110111101110101111010001111111

Decoding is a simple matter of traversing the tree:

def huffman_decode(encoded, tree, string=False):  
    decoded = []
    i = 0
    while i < len(encoded):
        sym = encoded[i]
        label = tree[sym]
        # Continue untill leaf is reached
        while not isinstance(label, str):
            i += 1
            sym = encoded[i]
            label = label[sym]        
        decoded.append(label)        
        i += 1
    if string == True:
        return ''.join([e for e in decoded])
    return decoded

print(huffman_decode(huffe, tree, string=True))

Out: This is a simple text, made to illustrate how a huff-man encoder works. A huff-man encoder works best when the text is of reasonable length and has repeating patterns in its language.

This answer is in large parts stolen from my own GitHub: https://github.com/RoyMN/Python/tree/master/tools/text_handler

RoyM
  • 735
  • 3
  • 14