1

I'm doing NLP in Python 3 and trying to optimize the speed of a code. The code converts a list of words to a list (or array) of numbers using a given dictionary.

For example,

mydict = {'hello': 0, 'world': 1, 'this': 2, 'is': 3, 'an': 4, 'example': 5}
word_list = ['hello', 'world']

def f(mydict, word_list):
    return [mydict[w] for w in word_list]

# f(mydict, word_list) == [1, 2]

I want to speed up the function f, especially when the word_list is about 100 words long. Is it possible? Use of external libraries like nltk, spacy, numpy etc is OK.

Currently, it takes 6us on my laptop.

>>> %timeit f(mydict, word_list*50)
6.74 us +- 2.77 us per loop (mean +- std. dev. of 7 runs, 100000 loops each)
ywat
  • 2,757
  • 5
  • 24
  • 32
  • Have you played around with the [sklearn CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html). I'm pretty sure it does what you want and I would assume it's well optimized, but I have not tested. – it's-yer-boy-chet Aug 02 '18 at 00:06
  • 1
    If this works and you're looking for tips on optimizing, I suggest instead posting to [CodeReview](https://codereview.stackexchange.com/) – BruceWayne Aug 02 '18 at 00:32
  • @Bruce, speedup questions, especially for numpy, are perfectly fine here on SO. We answer those all the time. – hpaulj Aug 02 '18 at 01:25
  • `us` times like that look normal. – hpaulj Aug 02 '18 at 01:28
  • 1
    Tbh I don't think you can go much further than *list comprehension* plus `dicts` O(1) lookup – rafaelc Aug 02 '18 at 03:17
  • this answer, [Hash tables versus binary trees](https://cs.stackexchange.com/a/278), is interesting – xdze2 Aug 02 '18 at 19:23
  • I've heard that using a `set` of strings instead of a `list` of strings can be faster--i'm curious if this is the case for you. (https://stackoverflow.com/questions/8929284/what-makes-sets-faster-than-lists-in-python) – matt_07734 Aug 06 '18 at 18:35

1 Answers1

0

There are multiple libraries to handle convert string/list of tokens to a vectorial representation.

For example, with gensim:

>>> import gensim
>>> from gensim.corpora import Dictionary
>>> documents = [['hello', 'world'], ['NLP', 'is', 'awesome']]
>>> dict = Dictionary(documents)

# This is not necessary, but if you need to debug
# the word and attached indices, you can do:

>>> {idx:dict[idx]for idx in dict}
{0: 'hello', 1: 'world', 2: 'NLP', 3: 'awesome', 4: 'is'}

# To get the indices of the words per document, e.g.
>>> dict.doc2idx('hello world'.split())
[0, 1]
>>> dict.doc2idx('hello world is awesome'.split())
[0, 1, 4, 3]
alvas
  • 115,346
  • 109
  • 446
  • 738
  • This code, using gensim, does not speed up compared with the original code. On my laptop, dict.doc2idx: 37us vs PythonLoop(f function above): 9us. – ywat Aug 03 '18 at 16:35