8

I have this string

message = '10100010011'

and this dictionary

codes = {97: '1', 98: '01', 107: '001', 114: '000'}

and I need to substitute the original message using the dictionary to something like this

[97, 98, 114, 97, 107, 97]

I tried my own way, which works, but when I use some REALLY large strings, it's just really slow. Is there any faster way to do this than this?

    codes = dict(zip(codes.values(), codes.keys()))
    decoded_mess = []
    pom = ""
    for i in message:
        pom += i
        if pom in codes:
            decoded_mess.append(codes[pom])
            pom = ""

I saw the answers here Easiest way to replace a string using a dictionary of replacements? and I tried that, but that did not work for me. Maybe because they are dealing with whole words, but I have 1 long string of 1s and 0s.

Community
  • 1
  • 1
Arcane
  • 669
  • 1
  • 10
  • 25
  • 3
    Isn't your dictionary the wrong way around? And shouldn't the keys (or values, currently) all have a fixed length? – jonrsharpe Nov 03 '15 at 16:04
  • @jonrsharpe he's swapping key-values in the solution. – Maroun Nov 03 '15 at 16:07
  • @MarounMaroun oh... then why show that separately?! – jonrsharpe Nov 03 '15 at 16:08
  • Given the current constraints, you can't do it much more efficiently. If you had fixed-length keys, it would be easier (as you could slice the string into the appropriate lengths to start, rather than building it back up until it matches). – jonrsharpe Nov 03 '15 at 16:09
  • 1
    @jonrsharpe Those keys (1, 01, 001, 000) look like it might be Huffman coding. – 15ee8f99-57ff-4f92-890c-b56153 Nov 03 '15 at 16:09
  • @EdPlunkett ah, I'm not familiar with that – jonrsharpe Nov 03 '15 at 16:10
  • Your solution is O(n), I don't think you can have a better complexity time for this. – Maroun Nov 03 '15 at 16:12
  • @jonrsharpe It's [very cool](https://en.wikipedia.org/wiki/Huffman_coding). And looking at the dictionary and the array of ASCII values, it looks like that is the case. It's the string "abraka" encoded in 11 bits, not counting the key table of course. – 15ee8f99-57ff-4f92-890c-b56153 Nov 03 '15 at 16:20
  • Stackoverflow detectives got it right :) It is Huffman coding. I am making my own implementation and now I am trying to optimize it and I lose a lot of time on this decoding, so I am wondering if there is anything I can do about it. – Arcane Nov 03 '15 at 16:23
  • How big's your `codes` dict in real life? – NPE Nov 03 '15 at 16:24
  • See http://stackoverflow.com/questions/2235208/how-to-decode-huffman-code-quickly – NPE Nov 03 '15 at 16:26
  • Questions about performance are best suited for codereview. – Avinash Raj Nov 03 '15 at 16:32
  • The biggest file I tried is 50MB pdf file, which creates dict of len = 256 and the string length = 422 702 808 – Arcane Nov 03 '15 at 16:40
  • Creating a cffi / Ctypes or Cython wrapper to call an existing library (such as https://github.com/drichardson/huffman) might be the next optimisation step here if you really need the speed. – James Nov 03 '15 at 17:12

1 Answers1

-1

First of, the codes dictionary should be backward to make looking up easier. My strategy is to scan the message one character at a time. If a replacement found, return it. If not, add the next character and look up again. Keep doing this until a replacement found or the message is exhausted.

def seach_replace(buffer, codes):
    codes = {v: k for k, v in codes.items()}  # Reverse the key, value
    text_so_far = ''
    for c in buffer:
        text_so_far += c
        if text_so_far in codes:
            yield codes[text_so_far]
            text_so_far = ''
    if text_so_far:
        yield text_so_far

if __name__ == '__main__':
    message = '10100010011'
    codes = {97: '1', 98: '01', 107: '001', 114: '000'}
    print(list(seach_replace(message, codes)))

Output:

[97, 98, 114, 97, 107, 97]
Hai Vu
  • 37,849
  • 11
  • 66
  • 93