0

I have a code that does the following:

  • Generate word vectors using brown corpus fron nltk
  • maintain 2 list, one having few positive sentimental words (eg: good, happy, nice) and other negative sentimental words (ed. bad, sad, unnhappy)
  • Define a statement whose sentiment we wish to obtain.
  • perform preprocessing on this statement (tokenize, lowercase, remove special characters, remove stopwords, lemmatize words
  • Generate word vectors for all these words and store it in a list
  • I have a test sentence of 7 words and I wish to determine its sentiment. First I define two lists:
  1. good_words=[good, excellent, happy]
  2. bad_words=[bad,terrible,sad]

Now I run a loop taking i words at a time where i ranges from 1 to sentence length. For a particular i, I have few windows of words that span the test sentence. For each window, I take average of word vectors of the window and compute euclidian distance of this windowed vector and the 2 lists.For example i= 3, and test sentence: food looks fresh healthy. I will have 2 windows: food looks fresh and looks fresh healthy for i =3. Now I take mean of vectors of the words in each window and compute euclidian distance with the good_words and bad_words. So corresponding to each word in both lists I will have 2 values(for 2 windows). Now I take mean of these 2 values for each word in the lists and whichever word has least distance lies closest to the test sentence.

I wish to show that window size(i) = 3 or 4 shows highest accuracy in determining the sentiment of test sentence but I am facing difficulty in achieving it. Any leads on how I can produce my results would be highly appreciated.

Thanks in advance.

b = Word2Vec(brown.sents(), window=5, min_count=5, negative=15,  size=50, iter= 10, workers=multiprocessing.cpu_count())
pos_words=['good','happy','nice','excellent','satisfied']
neg_words=['bad','sad','unhappy','disgusted','afraid','fearful','angry']
pos_vec=[b[word] for word in pos_words]
neg_vec=[b[word] for word in neg_words]


test="Sound quality on both end is excellent."
tokenized_word= word_tokenize(test)
lower_tokens= convert_lowercase(tokenized_word)
alpha_tokens= remove_specialchar(lower_tokens)
rem_tokens= removestopwords(alpha_tokens)
lemma_tokens= lemmatize(rem_tokens)
word_vec=[b[word] for word in lemma_tokens]

for i in range(0,len(lemma_tokens)):
    windowed_vec=[]
    for j in range(0,len(lemma_tokens)-i):
        windowed_vec.append(np.mean([word_vec[j+k] for k in range(0,i+1)],axis=0))
    gen_pos_arr=[]
    gen_neg_arr=[]
    for p in range(0,len(pos_vec)):
        gen_pos_arr.append([euclidian_distance(vec,pos_vec[p]) for vec in windowed_vec])
    for q in range(0,len(neg_vec)):
        gen_neg_arr.append([euclidian_distance(vec,neg_vec[q]) for vec in windowed_vec])
    gen_pos_arr_mean=[]
    gen_pos_arr_mean.append([np.mean(x) for x in gen_pos_arr])
    gen_neg_arr_mean=[]
    gen_neg_arr_mean.append([np.mean(x) for x in gen_neg_arr])
    min_value=np.min([np.min(gen_pos_arr_mean),np.min(gen_neg_arr_mean)])
    for v in gen_pos_arr_mean:
        print('min value:',min_value)
        if min_value in v:
            print('pos',v)
            plt.scatter(i,min_value,color='blue')
            plt.text(i,min_value,pos_words[gen_pos_arr_mean[0].index(min_value)])
        else:
            print('neg',v)
            plt.scatter(i,min_value,color='red')
            plt.text(i,min_value,neg_words[gen_neg_arr_mean[0].index(min_value)])
print(test)
plt.title('')
plt.xlabel('window size')
plt.ylabel('avg of distances of windows from sentiment words')
plt.show()
Shreyas
  • 1
  • 2
  • Welcome to Stackoverflow, could you describe in more detail, what is the question? – alvas Jun 29 '20 at 03:13
  • Also, avoid looping through the data/text multiple times, see https://stackoverflow.com/questions/47769818/why-is-my-nltk-function-slow-when-processing-the-dataframe – alvas Jun 29 '20 at 03:14
  • @alvas I have edited the question. Please have a look at it. I would highly appreciate any help as I can't seem to move forward with the problem. – Shreyas Jul 01 '20 at 19:01

0 Answers0