Substitute substrings using a dictionary in Python

Question

I have this string

message = '10100010011'

and this dictionary

codes = {97: '1', 98: '01', 107: '001', 114: '000'}

and I need to substitute the original message using the dictionary to something like this

[97, 98, 114, 97, 107, 97]

I tried my own way, which works, but when I use some REALLY large strings, it's just really slow. Is there any faster way to do this than this?

    codes = dict(zip(codes.values(), codes.keys()))
    decoded_mess = []
    pom = ""
    for i in message:
        pom += i
        if pom in codes:
            decoded_mess.append(codes[pom])
            pom = ""

I saw the answers here Easiest way to replace a string using a dictionary of replacements? and I tried that, but that did not work for me. Maybe because they are dealing with whole words, but I have 1 long string of 1s and 0s.

Isn't your dictionary the wrong way around? And shouldn't the keys (or values, currently) all have a fixed length? — jonrsharpe, Nov 03 '15 at 16:04
Given the current constraints, you can't do it much more efficiently. If you had fixed-length keys, it would be easier (as you could slice the string into the appropriate lengths to start, rather than building it back up until it matches). — jonrsharpe, Nov 03 '15 at 16:09
@jonrsharpe Those keys (1, 01, 001, 000) look like it might be Huffman coding. — 15ee8f99-57ff-4f92-890c-b56153, Nov 03 '15 at 16:09
Your solution is O(n), I don't think you can have a better complexity time for this. — Maroun, Nov 03 '15 at 16:12
@jonrsharpe It's [very cool](https://en.wikipedia.org/wiki/Huffman_coding). And looking at the dictionary and the array of ASCII values, it looks like that is the case. It's the string "abraka" encoded in 11 bits, not counting the key table of course. — 15ee8f99-57ff-4f92-890c-b56153, Nov 03 '15 at 16:20
Stackoverflow detectives got it right :) It is Huffman coding. I am making my own implementation and now I am trying to optimize it and I lose a lot of time on this decoding, so I am wondering if there is anything I can do about it. — Arcane, Nov 03 '15 at 16:23
See http://stackoverflow.com/questions/2235208/how-to-decode-huffman-code-quickly — NPE, Nov 03 '15 at 16:26
The biggest file I tried is 50MB pdf file, which creates dict of len = 256 and the string length = 422 702 808 — Arcane, Nov 03 '15 at 16:40
Creating a cffi / Ctypes or Cython wrapper to call an existing library (such as https://github.com/drichardson/huffman) might be the next optimisation step here if you really need the speed. — James, Nov 03 '15 at 17:12

score -1 · Answer 1 · answered Nov 03 '15 at 16:26

First of, the codes dictionary should be backward to make looking up easier. My strategy is to scan the message one character at a time. If a replacement found, return it. If not, add the next character and look up again. Keep doing this until a replacement found or the message is exhausted.

def seach_replace(buffer, codes):
    codes = {v: k for k, v in codes.items()}  # Reverse the key, value
    text_so_far = ''
    for c in buffer:
        text_so_far += c
        if text_so_far in codes:
            yield codes[text_so_far]
            text_so_far = ''
    if text_so_far:
        yield text_so_far

if __name__ == '__main__':
    message = '10100010011'
    codes = {97: '1', 98: '01', 107: '001', 114: '000'}
    print(list(seach_replace(message, codes)))

Output:

[97, 98, 114, 97, 107, 97]

How is this *any* different from OP's strategy? Apart from the semantics like using a generator? — Shashank, Nov 03 '15 at 16:27
@Shashank: A generator uses less memory by not creating an accumulated list. — Hai Vu, Nov 03 '15 at 16:36

Substitute substrings using a dictionary in Python

1 Answers1