NLP- sentiment analysis using word vectors

Question

I have a code that does the following:

Generate word vectors using brown corpus fron nltk
maintain 2 list, one having few positive sentimental words (eg: good, happy, nice) and other negative sentimental words (ed. bad, sad, unnhappy)
Define a statement whose sentiment we wish to obtain.
perform preprocessing on this statement (tokenize, lowercase, remove special characters, remove stopwords, lemmatize words
Generate word vectors for all these words and store it in a list
I have a test sentence of 7 words and I wish to determine its sentiment. First I define two lists:

good_words=[good, excellent, happy]
bad_words=[bad,terrible,sad]

Now I run a loop taking i words at a time where i ranges from 1 to sentence length. For a particular i, I have few windows of words that span the test sentence. For each window, I take average of word vectors of the window and compute euclidian distance of this windowed vector and the 2 lists.For example i= 3, and test sentence: food looks fresh healthy. I will have 2 windows: food looks fresh and looks fresh healthy for i =3. Now I take mean of vectors of the words in each window and compute euclidian distance with the good_words and bad_words. So corresponding to each word in both lists I will have 2 values(for 2 windows). Now I take mean of these 2 values for each word in the lists and whichever word has least distance lies closest to the test sentence.

I wish to show that window size(i) = 3 or 4 shows highest accuracy in determining the sentiment of test sentence but I am facing difficulty in achieving it. Any leads on how I can produce my results would be highly appreciated.

Thanks in advance.

b = Word2Vec(brown.sents(), window=5, min_count=5, negative=15,  size=50, iter= 10, workers=multiprocessing.cpu_count())
pos_words=['good','happy','nice','excellent','satisfied']
neg_words=['bad','sad','unhappy','disgusted','afraid','fearful','angry']
pos_vec=[b[word] for word in pos_words]
neg_vec=[b[word] for word in neg_words]


test="Sound quality on both end is excellent."
tokenized_word= word_tokenize(test)
lower_tokens= convert_lowercase(tokenized_word)
alpha_tokens= remove_specialchar(lower_tokens)
rem_tokens= removestopwords(alpha_tokens)
lemma_tokens= lemmatize(rem_tokens)
word_vec=[b[word] for word in lemma_tokens]

for i in range(0,len(lemma_tokens)):
    windowed_vec=[]
    for j in range(0,len(lemma_tokens)-i):
        windowed_vec.append(np.mean([word_vec[j+k] for k in range(0,i+1)],axis=0))
    gen_pos_arr=[]
    gen_neg_arr=[]
    for p in range(0,len(pos_vec)):
        gen_pos_arr.append([euclidian_distance(vec,pos_vec[p]) for vec in windowed_vec])
    for q in range(0,len(neg_vec)):
        gen_neg_arr.append([euclidian_distance(vec,neg_vec[q]) for vec in windowed_vec])
    gen_pos_arr_mean=[]
    gen_pos_arr_mean.append([np.mean(x) for x in gen_pos_arr])
    gen_neg_arr_mean=[]
    gen_neg_arr_mean.append([np.mean(x) for x in gen_neg_arr])
    min_value=np.min([np.min(gen_pos_arr_mean),np.min(gen_neg_arr_mean)])
    for v in gen_pos_arr_mean:
        print('min value:',min_value)
        if min_value in v:
            print('pos',v)
            plt.scatter(i,min_value,color='blue')
            plt.text(i,min_value,pos_words[gen_pos_arr_mean[0].index(min_value)])
        else:
            print('neg',v)
            plt.scatter(i,min_value,color='red')
            plt.text(i,min_value,neg_words[gen_neg_arr_mean[0].index(min_value)])
print(test)
plt.title('')
plt.xlabel('window size')
plt.ylabel('avg of distances of windows from sentiment words')
plt.show()

Welcome to Stackoverflow, could you describe in more detail, what is the question? — alvas, Jun 29 '20 at 03:13
Also, avoid looping through the data/text multiple times, see https://stackoverflow.com/questions/47769818/why-is-my-nltk-function-slow-when-processing-the-dataframe — alvas, Jun 29 '20 at 03:14
@alvas I have edited the question. Please have a look at it. I would highly appreciate any help as I can't seem to move forward with the problem. — Shreyas, Jul 01 '20 at 19:01

NLP- sentiment analysis using word vectors

0 Answers0