0

I have a have .txt file that I am using that has multiple lines that contain sentences. Let's say that file is called sentences.txt. I also have a dictionary that I am using that contains pre-defined sentiment for about 2500 words, let's call that dictionary sentiment_scores. My goal is to return a dictionary that predicts the sentiment value for a word that is not in sentiment_scores. I am doing this by taking the average score for each sentence that the word is in.

with open('sentences.txt', 'r') as f:
        sentences = [line.strip() for line in f]
        f.close()

for line in sentences:
    for word in line.split(): #This will iterate through words in the sentence
        if not (word in sentiment_scores):
            new_term_sent[word] = 0 #Assign word a sentiment value of 0 initially

for key in new_term_sent:

    score = 0
    num_sentences = 0
    for sentence in sentences:
        if key in sentence.split():
            num_sentences+=1
            val = get_sentiment(sentence) #This function returns the sentiment of a sentence
            score+=val
    if num_sentences != 0:
        average = round((score)/(num_sentences),1)
        new_term_sent[key] = average


return new_term_sent

Please note: this method works, but the time complexity is too long, takes about 80 seconds to run on my laptop.

My question is therefore how I can do this more efficiently? I have tried just using .readlines() on sentence.txt, but that did not work (can't figure out why, but I know it has to do with iterating through the text file multiple times; maybe a pointer is disappearing somehow). Thank you in advance!

bruno desthuilliers
  • 75,974
  • 6
  • 88
  • 118
JaYeFFKaY
  • 17
  • 3
  • Why dont you use Machine learning for that. You can find an approach [here](https://www.kaggle.com/ngyptr/python-nltk-sentiment-analysis) – Jeril Feb 01 '19 at 12:00
  • @Jeril that is a wonderful approach, but this is for an academic project and my professor stated not to use any machine learning methods. Thank you. – JaYeFFKaY Feb 01 '19 at 12:02
  • @Jeril How about using async / await for concurrency ? – Rambarun Komaljeet Feb 01 '19 at 12:06
  • This is a Python loop level program, unfortunately the time you'd save performing this code is insignificant. – Lucas Hort Feb 01 '19 at 12:09
  • @RambarunKomaljeet For concurrency I use concurrent.futures. Its quite easy. Can you check [here](https://stackoverflow.com/a/45213153/2825570), if it helps – Jeril Feb 01 '19 at 12:23

1 Answers1

0

Asides from using concurrency which may be rather complex, you can optimize your loops. If all the words in a sentence are unique and the sentence has M words on average, the current code calls compute_sentiment on the same sentence M times.

Instead of getting all the individual words into new_term_sent and initializing the value to zero, let each individual word map to an empty list. Then, you can instead compute the sentiment for every sentence once and append that value to all words which appear in that sentence.

word_to_scores = defaultdict(list)
for sentence in sentences:
    sentence_sentiment = compute_sentiment(sentence)
    for word in line.split():              
        word_to_scores[word].append(sentence_sentiment) 

for word,sentence_sentiments in word_to_scores.items():
    new_term_sent[word] = sentence_sentiments/len(sentence_sentiments)

P.S. the original code as well as this assumes every line is an individual sentence. I am not sure that assumption is fine for you.

P.P.S. I don't think the belowmentioned block of code is ever called. The loop only iterates over keys in the dictionary but all keys in the dictionary previously appeared in some sentence so the num_sentences is always >= 1.

if num_sentences != 0:
    average = round((score)/(num_sentences),1)
    new_term_sent[key] = average
Yarnspinner
  • 852
  • 5
  • 7