I have a code that does the following:
- Generate word vectors using brown corpus fron nltk
- maintain 2 list, one having few positive sentimental words (eg: good, happy, nice) and other negative sentimental words (ed. bad, sad, unnhappy)
- Define a statement whose sentiment we wish to obtain.
- perform preprocessing on this statement (tokenize, lowercase, remove special characters, remove stopwords, lemmatize words
- Generate word vectors for all these words and store it in a list
- I have a test sentence of 7 words and I wish to determine its sentiment. First I define two lists:
- good_words=[good, excellent, happy]
- bad_words=[bad,terrible,sad]
Now I run a loop taking i words at a time where i ranges from 1 to sentence length. For a particular i, I have few windows of words that span the test sentence. For each window, I take average of word vectors of the window and compute euclidian distance of this windowed vector and the 2 lists.For example i= 3, and test sentence: food looks fresh healthy. I will have 2 windows: food looks fresh and looks fresh healthy for i =3. Now I take mean of vectors of the words in each window and compute euclidian distance with the good_words and bad_words. So corresponding to each word in both lists I will have 2 values(for 2 windows). Now I take mean of these 2 values for each word in the lists and whichever word has least distance lies closest to the test sentence.
I wish to show that window size(i) = 3 or 4 shows highest accuracy in determining the sentiment of test sentence but I am facing difficulty in achieving it. Any leads on how I can produce my results would be highly appreciated.
Thanks in advance.
b = Word2Vec(brown.sents(), window=5, min_count=5, negative=15, size=50, iter= 10, workers=multiprocessing.cpu_count())
pos_words=['good','happy','nice','excellent','satisfied']
neg_words=['bad','sad','unhappy','disgusted','afraid','fearful','angry']
pos_vec=[b[word] for word in pos_words]
neg_vec=[b[word] for word in neg_words]
test="Sound quality on both end is excellent."
tokenized_word= word_tokenize(test)
lower_tokens= convert_lowercase(tokenized_word)
alpha_tokens= remove_specialchar(lower_tokens)
rem_tokens= removestopwords(alpha_tokens)
lemma_tokens= lemmatize(rem_tokens)
word_vec=[b[word] for word in lemma_tokens]
for i in range(0,len(lemma_tokens)):
windowed_vec=[]
for j in range(0,len(lemma_tokens)-i):
windowed_vec.append(np.mean([word_vec[j+k] for k in range(0,i+1)],axis=0))
gen_pos_arr=[]
gen_neg_arr=[]
for p in range(0,len(pos_vec)):
gen_pos_arr.append([euclidian_distance(vec,pos_vec[p]) for vec in windowed_vec])
for q in range(0,len(neg_vec)):
gen_neg_arr.append([euclidian_distance(vec,neg_vec[q]) for vec in windowed_vec])
gen_pos_arr_mean=[]
gen_pos_arr_mean.append([np.mean(x) for x in gen_pos_arr])
gen_neg_arr_mean=[]
gen_neg_arr_mean.append([np.mean(x) for x in gen_neg_arr])
min_value=np.min([np.min(gen_pos_arr_mean),np.min(gen_neg_arr_mean)])
for v in gen_pos_arr_mean:
print('min value:',min_value)
if min_value in v:
print('pos',v)
plt.scatter(i,min_value,color='blue')
plt.text(i,min_value,pos_words[gen_pos_arr_mean[0].index(min_value)])
else:
print('neg',v)
plt.scatter(i,min_value,color='red')
plt.text(i,min_value,neg_words[gen_neg_arr_mean[0].index(min_value)])
print(test)
plt.title('')
plt.xlabel('window size')
plt.ylabel('avg of distances of windows from sentiment words')
plt.show()