Fast dictionary lookup for word list in Python

Question

I'm doing NLP in Python 3 and trying to optimize the speed of a code. The code converts a list of words to a list (or array) of numbers using a given dictionary.

For example,

mydict = {'hello': 0, 'world': 1, 'this': 2, 'is': 3, 'an': 4, 'example': 5}
word_list = ['hello', 'world']

def f(mydict, word_list):
    return [mydict[w] for w in word_list]

# f(mydict, word_list) == [1, 2]

I want to speed up the function f, especially when the word_list is about 100 words long. Is it possible? Use of external libraries like nltk, spacy, numpy etc is OK.

Currently, it takes 6us on my laptop.

>>> %timeit f(mydict, word_list*50)
6.74 us +- 2.77 us per loop (mean +- std. dev. of 7 runs, 100000 loops each)

Have you played around with the [sklearn CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html). I'm pretty sure it does what you want and I would assume it's well optimized, but I have not tested. — it's-yer-boy-chet, Aug 02 '18 at 00:06
If this works and you're looking for tips on optimizing, I suggest instead posting to [CodeReview](https://codereview.stackexchange.com/) — BruceWayne, Aug 02 '18 at 00:32
@Bruce, speedup questions, especially for numpy, are perfectly fine here on SO. We answer those all the time. — hpaulj, Aug 02 '18 at 01:25
Tbh I don't think you can go much further than *list comprehension* plus `dicts` O(1) lookup — rafaelc, Aug 02 '18 at 03:17
this answer, [Hash tables versus binary trees](https://cs.stackexchange.com/a/278), is interesting — xdze2, Aug 02 '18 at 19:23
I've heard that using a `set` of strings instead of a `list` of strings can be faster--i'm curious if this is the case for you. (https://stackoverflow.com/questions/8929284/what-makes-sets-faster-than-lists-in-python) — matt_07734, Aug 06 '18 at 18:35

score 0 · Answer 1 · answered Aug 02 '18 at 08:23

There are multiple libraries to handle convert string/list of tokens to a vectorial representation.

For example, with gensim:

>>> import gensim
>>> from gensim.corpora import Dictionary
>>> documents = [['hello', 'world'], ['NLP', 'is', 'awesome']]
>>> dict = Dictionary(documents)

# This is not necessary, but if you need to debug
# the word and attached indices, you can do:

>>> {idx:dict[idx]for idx in dict}
{0: 'hello', 1: 'world', 2: 'NLP', 3: 'awesome', 4: 'is'}

# To get the indices of the words per document, e.g.
>>> dict.doc2idx('hello world'.split())
[0, 1]
>>> dict.doc2idx('hello world is awesome'.split())
[0, 1, 4, 3]

This code, using gensim, does not speed up compared with the original code. On my laptop, dict.doc2idx: 37us vs PythonLoop(f function above): 9us. — ywat, Aug 03 '18 at 16:35

Fast dictionary lookup for word list in Python

1 Answers1