0

I have a list of about 50,000 or so words, and I want to pass a function on each item in the list. Then I want to save the original word as a key, and the translated word as the respective value in a dictionary. Right now I know I can do this:

translations = {word: translate(word) for word in word_list}

But this takes too long I think. Is there a faster way this can be accomplished?

vkumar
  • 863
  • 4
  • 9
  • 14
  • 1
    How do you anticipate it getting faster? – miradulo May 15 '16 at 15:07
  • Not sure, just wondering. Right now it seems to take quite a while and I just thought there might be a more efficient way. – vkumar May 15 '16 at 15:08
  • 2
    It's very likely that the majority of your time is spent inside of `translate`. – chthonicdaemon May 15 '16 at 15:19
  • Have you tried the `map()` function? – Pouria May 15 '16 at 15:38
  • can you see how much time translate(word) is taking for each word ? If its taking more time, so you might need to improve the code there. – Gunjan May 15 '16 at 15:51
  • Thanks, I rewrote the translate function, and it improved the speed greatly. – vkumar May 15 '16 at 16:06
  • 1
    The lesson here is knowing what to optimize so you don't waste your time doing it to code that doesn't matter, which is fairly easy to do in Python — see [_How can you profile a Python script?_](http://stackoverflow.com/questions/582336/how-can-you-profile-a-python-script) – martineau May 15 '16 at 17:47
  • You may consider profiling your code, to see where it spends its time. Try https://github.com/rkern/line_profiler et al. – boardrider May 16 '16 at 09:55

2 Answers2

0

Mapping functions should work faster than dict comprehensions:

translations = dict(zip(word_list, map(translate, word_list)))

What happens here is:

  • We apply the function to each element in word_list, returning a map object
  • Combine it into a sequence (zip object) of one-to-one element tuples from the original list and that map object
  • Convert the resulting sequence into a dictionary

After setting up a test program, it appears that there is a slight performance improvement. This is the code:

from datetime import datetime
def translate(wo):
    return wo.upper()

word_list = {str(i):str(i+1) for i in range(50000)}
d = datetime.now()
translations = dict(zip(word_list, map(translate, word_list)))
print(datetime.now() - d)
d = datetime.now()
translations = {word: translate(word) for word in word_list}
print(datetime.now() - d)

After a few runs, the second printed time is always greater than the first one, which proves the efficiency.

illright
  • 3,991
  • 2
  • 29
  • 54
0

If you only need few values, and won't iterate over the dict, you can try doing it lazily:

class MyDefaultDict(dict):
    def __init__(self, word_iterable, translate):
        self.word_set = frozenset(word_iterable)
        self.translate = translate
    def __missing__(self, key):
        if key in self.word_set:
            translated = translate(key)
            self[key] = translated
            return translated
        raise KeyError(key)
GingerPlusPlus
  • 5,336
  • 1
  • 29
  • 52