I have a very large data frame full of song lyrics. I've tokenized the lyrics column so each row is a list of lyrics, i.e. ["You", "say", "goodbye", "and", "I", "say", "hello"]
and so on. I wrote a function to calculate the sentiment score using a list of positive words and negative words. I then need to apply this function to the lyrics column to calculate positive sentiment, negative sentiment, and net sentiment and make them new columns.
I attempted to split my data frame into a list of chunks of 1000 and then loop through to apply, but it is still taking a fairly long time. I'm wondering if there is a more efficient way that I should be doing this, or if this is as good as it gets and I just have to wait it out.
def sentiment_scorer(row):
pos=neg=0
for item in row['lyrics']:
# count positive words
if item in positiv:
pos += 1
# count negative words
elif item in negativ:
neg += 1
# ignore words that are neither negative nor positive
else:
pass
# set sentiment to 0 if pos is 0
if pos < 1:
pos_sent = 0
else:
pos_sent = pos / len(row['lyrics'])
# set sentiment to 0 if neg is 0
if neg < 1:
neg_sent = 0
else:
neg_sent = neg / len(row['lyrics'])
# return positive and negative sentiment to make new columns
return pos_sent, neg_sent
# chunk data frames
n = 1000
list_df = [lyrics_cleaned_df[i:i+n] for i in range(0,lyrics_cleaned_df.shape[0],n)]
for lr in range(len(list_df)):
# credit for method: toto_tico on Stack Overflow https://stackoverflow.com/a/46197147
list_df[lr]['positive_sentiment'], list_df[lr]['negative_sentiment'] = zip(*list_df[lr].apply(sentiment_scorer, axis=1))
list_df[lr]['net_sentiment'] = list_df[lr]['positive_sentiment'] - list_df[lr]['negative_sentiment']
ETA: sample data frame
data = [['ego-remix', 2009, 'beyonce-knowles', 'Pop', ['oh', 'baby', 'how']],
['then-tell-me', 2009, 'beyonce-knowles', 'Pop', ['playin', 'everything', 'so']],
['honesty', 2009, 'beyonce-knowles', 'Pop', ['if', 'you', 'search']]]
df = pd.DataFrame(data, columns = ['song', 'year', 'artist', 'genre', 'lyrics'])