0

I have an excel file that includes 10K rows, each row has some information for a tweet. For example these columns: Tweet, Date of Tweet, User Name, Retweet Count, ..., User Location, Sentiment(the value in this column is Positive or negative or neutral), State(the value of this column is 50 states of the USA), Abbreviation (the value of this column is the abbreviation of state such as CA, NJ, NY,..), CountofNegative (this column is empty and I wish to write the number of Negative tweets for each state in this column, so this column will have 50 numbers).

in following you can see the screenshot of this dataset: enter image description here

Problem: count the number of negative tweets for each state or its abbreviation and write in CountofNegative column. following is my code:

import pandas as pd

file=pd.read_excel("C:/Users/amtol/Desktop/Project/filter.xlsx")
UserLocation= file["User Location"]
Sentiment= file["Sentiment"]
CountofNegative= file["CountofNegative"]
State=file["State"]
Abbreviation= file["Abbreviation"]

for i, (loc,sent) in enumerate(zip(UserLocation, Sentiment)):
    count=0
    for j, (state, abbr) in enumerate(zip(State, Abbreviation)):
        if (loc == state or loc == abbr and sent == "Negative"):
            count=count+1
        file.loc[j+1,"CountofNegative"]=count

print(CountofNegative)

file.to_excel("C:/Users/amtol/Desktop/Project/filter.xlsx")

there is no error but when the output file is created, the first 24 values for column "CountofNegative" are zeros and the rest of them are ones(which they are not correct answers). Also, I wanted to test the program by print(CountofNegative), which still nothing happened. (no output). how can I fix my code?

mvtn
  • 35
  • 8
  • 1
    Please [provide a reproducible copy of the DataFrame with `df.to_clipboard(sep=',')`](https://stackoverflow.com/questions/52413246/how-to-provide-a-copy-of-your-dataframe-with-to-clipboard). [Stack Overflow Discourages Screenshots](https://meta.stackoverflow.com/questions/303812/discourage-screenshots-of-code-and-or-errors). It is likely the question will be downvoted. You are discouraging assistance because no one wants to retype your data or code, and screenshots are often illegible. – Trenton McKinney May 11 '20 at 20:06
  • Or you don't want to provide the data or if it is too big, demonstrate a sample data that would look like the real one. – Dejene T. May 11 '20 at 20:19

1 Answers1

1

Okay, so if there are no generalities in the abbreviation and the state names then first convert the full names to abbreviation using the dict as in the code. Make some changes in the dict if some names/abbreviations are not proper.

Since we are concerned only with 'Negative' Counts. Convert Negative to 1 and other responses by 0 as below:

#Created sample dataset
 data={'State':['New York','New York','New York','New Jersey','New Jersey','New Jersey','California','California','California','NY','NJ','CA'],
'Sentiment' :['Negative','Positive','Negative','Neutral','Negative','Positive','Positive','Positive','Positive','Negative','Positive','Negative'], }
 df = pd.DataFrame(data, columns = ['State', 'Sentiment'])
 print (df)

#Dictionary of US states and abbreviations 
 di = {
'Alabama': 'AL',
'Alaska': 'AK',
'American Samoa': 'AS',
'Arizona': 'AZ',
'Arkansas': 'AR',
'California': 'CA',
'Colorado': 'CO',
'Connecticut': 'CT',
'Delaware': 'DE',
'District of Columbia': 'DC',
'Florida': 'FL',
'Georgia': 'GA',
'Guam': 'GU',
'Hawaii': 'HI',
'Idaho': 'ID',
'Illinois': 'IL',
'Indiana': 'IN',
'Iowa': 'IA',
'Kansas': 'KS',
'Kentucky': 'KY',
'Louisiana': 'LA',
'Maine': 'ME',
'Maryland': 'MD',
'Massachusetts': 'MA',
'Michigan': 'MI',
'Minnesota': 'MN',
'Mississippi': 'MS',
'Missouri': 'MO',
'Montana': 'MT',
'Nebraska': 'NE',
'Nevada': 'NV',
'New Hampshire': 'NH',
'New Jersey': 'NJ',
'New Mexico': 'NM',
'New York': 'NY',
'North Carolina': 'NC',
'North Dakota': 'ND',
'Northern Mariana Islands':'MP',
'Ohio': 'OH',
'Oklahoma': 'OK',
'Oregon': 'OR',
'Pennsylvania': 'PA',
'Puerto Rico': 'PR',
'Rhode Island': 'RI',
'South Carolina': 'SC',
'South Dakota': 'SD',
'Tennessee': 'TN',
'Texas': 'TX',
'Utah': 'UT',
'Vermont': 'VT',
'Virgin Islands': 'VI',
'Virginia': 'VA',
'Washington': 'WA',
'West Virginia': 'WV',
'Wisconsin': 'WI',
'Wyoming': 'WY'
}

#Match the names in the dictionary to columns using
df=df.replace({"State": di}) 

#Create a function to give weight only to negative comments
def convert_to_int(word):
word_dict = {'Negative':1, 'Positive':0, 'Neutral':0, 0: 0}
return word_dict[word]

#Convert the Sentiment col as per the above function
df['Sentiment'] = df['Sentiment'].apply(lambda x : convert_to_int(x))

#Now the final part of doing the count of negative
df['negative_sum'] = df['Sentiment'].groupby(df['State']).transform('sum')


#My final output

 State  Sentiment   negative_sum
0   NY  1   3
1   NY  0   3
2   NY  1   3
3   NJ  0   1
4   NJ  1   1
5   NJ  0   1
6   CA  0   1
7   CA  0   1
8   CA  0   1
9   NY  1   3
10  NJ  0   1
11  CA  1   1

Now, you also have the option to again convert the Sentiment Column to strings as now we have the column we needed of negative sum. I hope this suffice the purpose.

Sahil Jain
  • 48
  • 7
  • I have edited my question to be clear. would you please read it again? maybe this time you can understand my problem better. thank you – mvtn May 12 '20 at 22:41
  • Hey, edited my answer as per the question now. Please see if this suffice the purpose. – Sahil Jain May 13 '20 at 08:41
  • Thank you. so, you changed the label "negative" to 1 and the rest to 0. then you counted the number of ones for each state. Actually I want to find negative labels for the state and **its abbreviation**. for example, if there are 3 negatives for CA and 2 negatives for California, then our program assigns 5 in the cell J6. Also, how we can make this new column for "# of negatives for each state" in our current excel file? What is your idea about my code? is it totally wrong or can you fix it? – mvtn May 13 '20 at 18:07
  • Hey, made some changes to the code. Please see if this suffice the purpose. _If some state name abbreviation and full name does not match please edit the dictionary itself_. A new column of negative_sum has already been added in the last line of the code. Save this df as excel to have the desired column in the excel file too along with initial columns. – Sahil Jain May 13 '20 at 19:42
  • I added one line to the end for saving df as excel```df.to_excel("C:/Users/amtol/Desktop/Project/filter5.xlsx")```. I see the column "negative_sum" is created and all members of that, are 0 or 1. But in this dataset, I have 186 tweets(rows), and 50 states, which means some states should have the number greater than 1 for "negative_sum". is the last line which I added to save the file in excel, is wrong or your codes? – mvtn May 14 '20 at 04:13
  • hey,strange..because when i tried with dummy data i got the correct output. Please recheck again. I have also attached my data here above to verify the results. :) – Sahil Jain May 14 '20 at 07:14
  • Thank you so much. yes, you correct. you solved my question. thanks for your time. – mvtn May 15 '20 at 06:25