I have used Sentiment140 dataset for twitter for sentiment analysis
Code:
getting words from tweets:
tweet_tokens = []
[tweet_tokens.append(dev.get_tweet_tokens(idx)) for idx, item in enumerate(dev)]
getting unknown words from tokens
words_without_embs = []
[[words_without_embs.append(w) for w in tweet if w not in word2vec] for tweet in tweet_tokens]
len(words_without_embs)
last part of code, calculate vector as the mean of left and right words (context)
vectors = {} # alg
for word in words_without_embs:
mean_vectors = []
for tweet in tweet_tokens:
if word in tweet:
idx = tweet.index(word)
try:
mean_vector = np.mean([word2vec.get_vector(tweet[idx-1]), word2vec.get_vector(tweet[idx+1])], axis=0)
mean_vectors.append(mean_vector)
except:
pass
if tweet == tweet_tokens[-1]: # last iteration
mean_vector_all_tweets = np.mean(mean_vectors, axis=0)
vectors[word] = mean_vector_all_tweets
There are 1058532 words and the last part of this code works very slow, about 250 words per minute.
How can you improve the speed of this algorithm?