2

On one hand, I have an alphabetically sorted noun vocabulary (#7000)

aardvark
abacus
abbey
abbreviation
abdomen
ability
abnormal

On the other hand I have a set of words (#1E6)

['Hello', 'airport', 'really', 'sorry', 'to', 'hear', 'this'...]

What is the most efficient way to find out if a word is present in the vocabulary, and the index?

I could simply use lists/arrays and compare strings, but this does not take advantage of the alphabetic sorting of the vocabulary

eddie
  • 415
  • 4
  • 13
  • Quick searching in a sorted array/list -> binary search, with log(n) complexity – h4z3 May 27 '19 at 12:16
  • 2
    You could use sets, or dictionaries, or [binary search](https://docs.python.org/3/library/bisect.html#searching-sorted-lists). – mkrieger1 May 27 '19 at 12:17
  • 1
    You can load the vocabulary as dictionary. The value can be the index. Since the lookup time of a dictionary is O(1), I think it can be quite beneficial. I mean still, loading from the file will take some time, but it will take time anyway you do it. – Mansur May 27 '19 at 12:18
  • @MensurQulami Works for me as well – eddie May 27 '19 at 12:57

3 Answers3

1

You can use bisect in order to take advantage of the sorted vocabulary:

In [1]: d = ["aardvark", "abacus", "abbey", "abbreviation"]
In [2]: w = ['Hello', 'airport', 'really', 'sorry', 'to', 'hear', 'this', "aardvark"]
In [3]: for wd in w:
    ...:     try:
    ...:         index = bisect.bisect_left(d, wd)
    ...:         found = d[index]
    ...:         if found == wd:
    ...:             print(f"{wd} found at index {index}")
    ...:     except IndexError:
    ...:         pass
    ...:
aardvark found at index 0

Another option would be to use a dictionary, and search for word in set or dictionary.get(word) for the index - You can read my answer here for details about dict implementation in CPython.

AdamGold
  • 4,941
  • 4
  • 29
  • 47
1

As commented previously:

>>> vocab = ['a', 'b', 'c']
>>> vocab_lookup = {k:v for v,k in enumerate(vocab)}

And now all you need to use is dict.get or simply dict[]

>>> 'a' in vocab_lookup
True
>>> 'd' in vocab_lookup
False
>>> vocab_lookup.get('a')
0
>>> vocab_lookup.get('d')
>>> # None
Işık Kaplan
  • 2,815
  • 2
  • 13
  • 28
0

If the dictionary has unique entries (as I would expect), you can use a dict. x in dict returns true if x is a key in the given dict and (given no hash collisions) takes static time so that is the best we could ever get. It's worth mentioning that the worst case is O(n) but it is usually close to the best case. See this question for details.

To get the dict with the indices as values, use this line:

newdict = dict((k, v) for k, v in enumerate(sortedlist))

[Edit:] Note that this does not rely on a sorted list or any list at all. It will work for any iterable including open files with one word per line or the string.split() ...

If you want to keep your current data structure, you can use subtyping or docoration to keep a dict under the hood that is updated along and is used for this kind of lookup.

hajef
  • 334
  • 2
  • 13