2

I have used Sentiment140 dataset for twitter for sentiment analysis

Code:

getting words from tweets:

tweet_tokens = []
[tweet_tokens.append(dev.get_tweet_tokens(idx)) for idx, item in enumerate(dev)]

getting unknown words from tokens

words_without_embs = []
[[words_without_embs.append(w) for w in tweet if w not in word2vec] for tweet in tweet_tokens]
len(words_without_embs)

last part of code, calculate vector as the mean of left and right words (context)

vectors = {} # alg
for word in words_without_embs:
  mean_vectors = []
  for tweet in tweet_tokens:
    if word in tweet:
      idx = tweet.index(word)
      try:
        mean_vector = np.mean([word2vec.get_vector(tweet[idx-1]), word2vec.get_vector(tweet[idx+1])], axis=0)
        mean_vectors.append(mean_vector)
      except:
        pass

    if tweet == tweet_tokens[-1]: # last iteration
      mean_vector_all_tweets = np.mean(mean_vectors, axis=0)
      vectors[word] = mean_vector_all_tweets

There are 1058532 words and the last part of this code works very slow, about 250 words per minute.

How can you improve the speed of this algorithm?

OmG
  • 18,337
  • 10
  • 57
  • 90
Dmitry Sokolov
  • 1,303
  • 1
  • 17
  • 29
  • 2
    Have you tried to profile your code? – Daweo Oct 29 '21 at 14:21
  • 2
    Unrelated to performance but its considered non-Pythonic to use list comprehension for side effects i.e. `[tweet_tokens.append(dev.get_tweet_tokens(idx)) for...` see: [Is it Pythonic to use list comprehensions for just side effects?](https://stackoverflow.com/questions/5753597/is-it-pythonic-to-use-list-comprehensions-for-just-side-effects) – DarrylG Oct 29 '21 at 14:32

3 Answers3

6

One of the main reasons for your slow code is checking the existence of all words (near 1 million words) for each tweet in tweet_tokens. Hence, the time complexity of your implementation is 1e6 * |tweet_tokens|.

1) First Improvement (Reducing Search and Comparisons)

However, you can do it much better by tokenizing each tweet first, then finding the index of the word. If you built one dictionary over existing words, you can find the index of the word token with at most log(1e6) ~ 25 comparison from the word dictionary. Hence, in that case, the time complexity will be at most 25 * |tweet_tokens|. Therefore, you can improve the performance of your code 1e6/25 = 40000 times faster!

2) Second Improvment (Reducing Word2Vec Computations)

Moreover, you are always computing the vector of the same word in different tweets. Hence, the vector of each word will be computed f-times that f is the frequency of the word in tweets. A rational solution is computing the vector of all words in words_without_embs one time (it can be an offline process). Then, store all of these vectors based on the index of words in the word dictionary, for example (somehow to find them fast based on the word query). Eventually, merely read it from the prepared data structure for averaging computation. In that case, in addition to 40000 times improvement, you can improve the performance of your code by the factor of the sum of all words' frequencies in tweets.

OmG
  • 18,337
  • 10
  • 57
  • 90
2

This looks like it would parallelize well with some minor edits.

  • Can you move that last if tweet == tweet_tokens[-1]: block up a level and remove the "if" statement for it? That by itself would only give a minor speed increase, but with it in place, you could parallelize the code more effectively.
  • Consider making the inner for loop its own function and calling it via a multithreaded setup. (See ThreadPoolExecutor and thread_function() at this site for description and examples.) You could then have a separate thread for each processor on a single machine - or for each VM in a distributed environment. This should allow more effective scaling.
  • Building on @DarrylG 's comment above, restructure those list-comprehensions to avoid using .append() with an existing list. The .append() is slower than an equivalent list comprehension would be. For example, change [tweet_tokens.append(dev.get_tweet_tokens(idx)) for idx, item in enumerate(dev)] to tweet_tokens = [dev.get_tweet_tokens(idx) for idx, item in enumerate(dev)].
Sarah Messer
  • 3,592
  • 1
  • 26
  • 43
1

More-common (& probably better) strategies for dealing with unknown words include:

  • training/using a model, like FastText, that can offer guess-vectors for out-of-vocabulary (OOV) words
  • acquiring more training data, so vectors for more unknown words can be learned from real usages
  • ignoring unknown words entirely

It seems you've decided to instead synthesize new vectors for OOV words, by averaging all of their immediate neighbors. I don't think this would work especially well. In many kinds of downstream uses of the word-vectors, it just tends to overweight the word's in-context neighbors – which can also be very simply/cheaply achieved by just ignoring the unknown word entirely.

But given what you want to do, the best approach would be to collect the neighboring words during the same pass that identifies the words_without_embs.

For example, make words_without_embs a dict (or perhaps a DefaultDict), where each key is a word that will need a vector, and each value is a list of all the neighboring words you've found so far.

Then, one loop over the tweet_tokens would both fill the words_without_embs with keys that are the words needing vectors, while stuffing those values with all the neighboring-words seen so far.

Then, one last loop over the words_without_embs keys would simply grab the existing lists of neighbor-words for the averaging. (No more multiple passes over tweet_tokens.)

But again: all this work might not outperform the baseline practice of simply dropping unknown words.

gojomo
  • 52,260
  • 14
  • 86
  • 115
  • Thanks, what about finding levenshtein distance, to find a similar word? Most unknown words are typos – Dmitry Sokolov Oct 30 '21 at 18:56
  • Might be worth it – certainly such 'edit distance' measures are used in various kinds of spell- or OCR-correcting algorithms. You could try a pass over the data that takes ultra-rare or otherwise unknown words, and converts them to the edit-distance-closest word that's known or above some frequency threshold. But you that's the kind of ad-hoc measure that might be more complexity than it's worth – so you should be able to check quantitatively if it's hurting or helping. – gojomo Oct 30 '21 at 19:01
  • Note that w/ enough data, common typos appear often enough that a model can learn reasonable contextual word-vectors from them, which are close to 'intended' word. And in a model like FastText, because typos share character-n-grams (fragments) w/ their 'real' spellings, they get better-than-guesswork synthesized vectors. You could even take a non-FastText word-vector set, & try such edit-distance-based blending of similar-ish known vectors – but that could be a lot of per-word calculations for negligible benefit. (The same *coding time* or *training time* allocated elsewhere might be better.) – gojomo Oct 30 '21 at 19:06