Efficiently matching Words against a sorted Vocabulary in Python

Question

On one hand, I have an alphabetically sorted noun vocabulary (#7000)

aardvark
abacus
abbey
abbreviation
abdomen
ability
abnormal

On the other hand I have a set of words (#1E6)

['Hello', 'airport', 'really', 'sorry', 'to', 'hear', 'this'...]

What is the most efficient way to find out if a word is present in the vocabulary, and the index?

I could simply use lists/arrays and compare strings, but this does not take advantage of the alphabetic sorting of the vocabulary

Quick searching in a sorted array/list -> binary search, with log(n) complexity — h4z3, May 27 '19 at 12:16
You could use sets, or dictionaries, or [binary search](https://docs.python.org/3/library/bisect.html#searching-sorted-lists). — mkrieger1, May 27 '19 at 12:17
You can load the vocabulary as dictionary. The value can be the index. Since the lookup time of a dictionary is O(1), I think it can be quite beneficial. I mean still, loading from the file will take some time, but it will take time anyway you do it. — Mansur, May 27 '19 at 12:18

AdamGold · Answer 1 · 2019-05-27T13:28:20.393

1

You can use bisect in order to take advantage of the sorted vocabulary:

In [1]: d = ["aardvark", "abacus", "abbey", "abbreviation"]
In [2]: w = ['Hello', 'airport', 'really', 'sorry', 'to', 'hear', 'this', "aardvark"]
In [3]: for wd in w:
    ...:     try:
    ...:         index = bisect.bisect_left(d, wd)
    ...:         found = d[index]
    ...:         if found == wd:
    ...:             print(f"{wd} found at index {index}")
    ...:     except IndexError:
    ...:         pass
    ...:
aardvark found at index 0

Another option would be to use a dictionary, and search for word in set or dictionary.get(word) for the index - You can read my answer here for details about dict implementation in CPython.

edited May 27 '19 at 13:28

answered May 27 '19 at 12:31

AdamGold

4,941
4
29
47

Good and efficient solution, I will acccept if nothing better comes by – eddie May 27 '19 at 12:56
Actually the dictionnary is simpler – eddie May 27 '19 at 13:15
You can use either. You should run some tests on your actual data to see which one is faster, or just pick the one you prefer. – AdamGold May 27 '19 at 14:00

score 1 · Accepted Answer · answered May 27 '19 at 12:46

1

As commented previously:

>>> vocab = ['a', 'b', 'c']
>>> vocab_lookup = {k:v for v,k in enumerate(vocab)}

And now all you need to use is dict.get or simply dict[]

>>> 'a' in vocab_lookup
True
>>> 'd' in vocab_lookup
False
>>> vocab_lookup.get('a')
0
>>> vocab_lookup.get('d')
>>> # None

answered May 27 '19 at 12:46

Işık Kaplan

2,815
2
13
28

No need to create a dictionary. You might as well use a `set`. – AdamGold May 27 '19 at 13:02
1

Since when sets have indexes? – Işık Kaplan May 27 '19 at 13:03
Correct, forgot about the index! My bad. – AdamGold May 27 '19 at 13:04

hajef · Answer 3 · 2019-05-27T13:11:25.887

If the dictionary has unique entries (as I would expect), you can use a dict. x in dict returns true if x is a key in the given dict and (given no hash collisions) takes static time so that is the best we could ever get. It's worth mentioning that the worst case is O(n) but it is usually close to the best case. See this question for details.

To get the dict with the indices as values, use this line:

newdict = dict((k, v) for k, v in enumerate(sortedlist))

[Edit:] Note that this does not rely on a sorted list or any list at all. It will work for any iterable including open files with one word per line or the string.split() ...

If you want to keep your current data structure, you can use subtyping or docoration to keep a dict under the hood that is updated along and is used for this kind of lookup.

Efficiently matching Words against a sorted Vocabulary in Python

3 Answers3