-1

New to Python and trying to create a simple pandas dataframe from this for loop. The loop (1) iterates through each chapter of the book (chapters) and tokenizes by sentence, then (2) gets the polarity score for each sentence and adds each to the dictionary ('sentiments'), then (3) gets an average for all sentences in each chapter. The output is one dictionary of 4 scores for each chapter.

I need to create a dataframe with 28 rows (1 per chapter) and 4 columns (1 per score in each dictionary. What's the simplest way to accomplish this?

from nltk import tokenize
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer 

chapters = [ainulindale,valaquenta,ch1,ch2,ch3,ch4,ch5,ch6,ch7,ch8,ch9,ch10,ch11,ch12,ch13,ch14,ch15,ch16,ch17,
            ch18,ch19,ch20,ch21,ch22,ch23,ch24,akallabeth,rings]

analyzer = SentimentIntensityAnalyzer()

for chapter in chapters:
    sentence_list = tokenize.sent_tokenize(chapter)
    sentiments = {'compound': 0.0, 'neg': 0.0, 'neu': 0.0, 'pos': 0.0}

    for sentence in sentence_list:
        vs = analyzer.polarity_scores(sentence)
        sentiments['compound'] += vs['compound']
        sentiments['neg'] += vs['neg']
        sentiments['neu'] += vs['neu']
        sentiments['pos'] += vs['pos']

    sentiments['compound'] = sentiments['compound'] / len(sentence_list)
    sentiments['neg'] = sentiments['neg'] / len(sentence_list)
    sentiments['neu'] = sentiments['neu'] / len(sentence_list)
    sentiments['pos'] = sentiments['pos'] / len(sentence_list)

    print(sentiments)

The output for the print statement looks like this:

{'compound': 0.221757281553398, 'neg': 0.041514563106796104, 'neu': 0.8682621359223304, 'pos': 0.09024271844660196}
{'compound': 0.09577214285714292, 'neg': 0.06266428571428569, 'neu': 0.842964285714286, 'pos': 0.09440000000000001}
{'compound': 0.05855809523809526, 'neg': 0.06347619047619049, 'neu': 0.8621809523809518, 'pos': 0.07440000000000001}
{'compound': 0.1280093023255814, 'neg': 0.037604651162790693, 'neu': 0.8903488372093022, 'pos': 0.0720813953488372}
{'compound': -0.008434615384615398, 'neg': 0.07703076923076925, 'neu': 0.8496076923076921, 'pos': 0.07333846153846156}
{'compound': 0.20025294117647055, 'neg': 0.027411764705882358, 'neu': 0.910294117647059, 'pos': 0.06223529411764705}
{'compound': 0.24236, 'neg': 0.020013333333333327, 'neu': 0.9022666666666667, 'pos': 0.07770666666666666}
{'compound': 0.25085555555555544, 'neg': 0.056074074074074075, 'neu': 0.8129444444444446, 'pos': 0.1309814814814815}
{'compound': 0.02056170212765958, 'neg': 0.0704255319148936, 'neu': 0.8526382978723408, 'pos': 0.07694680851063829}
{'compound': -0.13621911764705882, 'neg': 0.09723529411764704, 'neu': 0.8521323529411767, 'pos': 0.05060294117647059}
{'compound': -0.07011322957198443, 'neg': 0.09842801556420237, 'neu': 0.8354124513618679, 'pos': 0.06617898832684826}
{'compound': 0.13921688311688318, 'neg': 0.04997402597402598, 'neu': 0.8669610389610388, 'pos': 0.083012987012987}
{'compound': 0.019619718309859153, 'neg': 0.08153521126760564, 'neu': 0.848169014084507, 'pos': 0.0702394366197183}
{'compound': 0.20739687499999998, 'neg': 0.04675, 'neu': 0.86025, 'pos': 0.09300000000000003}
{'compound': 0.05655333333333335, 'neg': 0.07552000000000003, 'neu': 0.8370933333333335, 'pos': 0.08737333333333329}
{'compound': 0.1834313253012048, 'neg': 0.03204819277108433, 'neu': 0.8945903614457832, 'pos': 0.07337349397590363}
{'compound': -0.058446464646464656, 'neg': 0.0901919191919192, 'neu': 0.8533737373737375, 'pos': 0.056434343434343434}
{'compound': 0.049436129032258073, 'neg': 0.06221935483870969, 'neu': 0.863077419354839, 'pos': 0.07469032258064519}
{'compound': 0.10077664233576646, 'neg': 0.053270072992700715, 'neu': 0.8727883211678833, 'pos': 0.07395620437956206}
{'compound': -0.09540880503144653, 'neg': 0.09535849056603773, 'neu': 0.8386918238993711, 'pos': 0.0659622641509434}
{'compound': -0.058940259740259765, 'neg': 0.08786363636363642, 'neu': 0.844915584415584, 'pos': 0.06720995670995672}
{'compound': -0.09371438356164379, 'neg': 0.09126712328767121, 'neu': 0.8470547945205481, 'pos': 0.06167808219178085}
{'compound': -0.10401964636542241, 'neg': 0.09612770137524558, 'neu': 0.8361139489194496, 'pos': 0.06777799607072695}
{'compound': -0.046306122448979595, 'neg': 0.07844217687074834, 'neu': 0.8614761904761906, 'pos': 0.06008163265306123}
{'compound': 0.05695540540540539, 'neg': 0.06936486486486487, 'neu': 0.8577702702702703, 'pos': 0.07287837837837836}
{'compound': -0.015284375000000006, 'neg': 0.07314843749999998, 'neu': 0.8589296875000001, 'pos': 0.06794531250000001}
{'compound': 0.05184410112359551, 'neg': 0.0851095505617977, 'neu': 0.82794382022472, 'pos': 0.08693258426966298}
{'compound': 0.023425435540069702, 'neg': 0.06889895470383278, 'neu': 0.8573484320557486, 'pos': 0.07374564459930318}
Dr.Data
  • 167
  • 1
  • 10
  • Simply create a list of dictionaries and then convert it into a pandas data frame. This link might help: https://stackoverflow.com/questions/20638006/convert-list-of-dictionaries-to-a-pandas-dataframe – vatsal gosar Sep 28 '19 at 13:40

1 Answers1

2
  • Create a list of dicts by adding two lines of code, as shown below.
  • Create a dataframe from sentiments_list
import pandas as pd

sentiments_list = list()  # add this line

for chapter in chapters:
    sentence_list = tokenize.sent_tokenize(chapter)
    sentiments = {'compound': 0.0, 'neg': 0.0, 'neu': 0.0, 'pos': 0.0}

    for sentence in sentence_list:
        vs = analyzer.polarity_scores(sentence)
        sentiments['compound'] += vs['compound']
        sentiments['neg'] += vs['neg']
        sentiments['neu'] += vs['neu']
        sentiments['pos'] += vs['pos']

    sentiments['compound'] = sentiments['compound'] / len(sentence_list)
    sentiments['neg'] = sentiments['neg'] / len(sentence_list)
    sentiments['neu'] = sentiments['neu'] / len(sentence_list)
    sentiments['pos'] = sentiments['pos'] / len(sentence_list)

    sentiments_list.append(sentiments)  # add this line

df = pd.DataFrame(sentiments_list)  # add this line

df output:

 compound       neg       neu       pos
 0.221757  0.041515  0.868262  0.090243
 0.095772  0.062664  0.842964  0.094400
 0.058558  0.063476  0.862181  0.074400
 0.128009  0.037605  0.890349  0.072081
-0.008435  0.077031  0.849608  0.073338
 0.200253  0.027412  0.910294  0.062235
Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
  • 2
    This is should be the accepted answer. Just wanted to point out `pd.DataFrame.from_dict` should also work. And if, for whatever reason, you have a string representation instead of a list, `pd.DataFrame.from_records`. – Andrew Sep 27 '19 at 19:24