I'm making my own rhymer is Python using NLTK. The CMU-dict has over 130,000 entries in a format like this:
[['griffon', ['G', 'R', 'IH1', 'F', 'AH0', 'N']],
['griffy', ['G', 'R', 'IH1', 'F', 'IY0']],
['grigas', ['G', 'R', 'AY1', 'G', 'AH0', 'Z']]]
These are word, pron(unciation) pairs. I manipulate the prons (maybe switch a 'G' for a 'T') and check if that's a word. I do this by using this function:
all_word_prons = get_word_pron_pairs_in_cmu_dict()
def pron_is_a_word(pronunciation):
for word, pron in all_word_prons:
if pronunciation == pron:
return True
else:
return False
all_word_prons is a Python Pickle file that is 10mb and contains all 130k entries
If I perform this look up 1000 times, it will take around 23 seconds, which isn't completely bad considering the task but there has to be a better algorithm. I've seen people recommend sets on bisects on other topics but those apply to simple string lookup. This is more or less checking to see if a list is equal, not a string.
I did implment some tree-like structure that contains data like this (using the example from above):
{'G': {'R': {'IH1': {'F': {'IY0': {0: 'griffy'}, 'AH0': {'N': {0: 'griffon'}}}}, 'AY1': {'G': {'AH0': {'Z': {0: 'grigas'}}}}}}}
This for some reason takes even longer than iterating through it simply. Perhaps my implementation is wrong. If you're curious:
def build_fast_idict_tree():
from nltk.corpus import cmudict
entries = cmudict.entries()
idict = {}
for entry in entries:
word, pronunciation = entry
idict_level = idict
for syl in pronunciation:
if syl not in idict_level:
idict_level[syl] = {}
idict_level = idict_level[syl]
idict_level[0] = word
return idict
def get_fast_idict_tree():
filename = "fast_idict_tree.pickle"
if os.path.isfile(filename):
list = pickle.load(open(filename, "rb"))
else:
list = build_fast_idict_tree()
pickle.dump(list, open(filename, "wb"))
return list
def lookup_in_fast_idict_tree(syls):
idict = get_fast_idict_tree()
for syl in syls:
if syl not in idict:
return False
idict= idict[syl]
return idict[0] if 0 in idict else False
TL:DR
What is the fastest way to do this sort of lookup (matching a list) in Python 3 in the year 2015?