0

Sorry for the long-winded question. I will make sure to clarify. What I have is a dataframe that looks a little like this: In [1]: df Out[1]:

    likes num_comments words
0    324   1032         song
1    123   4399         bird       
2    500   4993         pie
3    453   2003         bird
4    34    132          lucky
...                     ...
943  129   259          bird
944  594   5993         lucky
945  234   642          bird
946  95    813          song
947  461   6122         pie

I want a dataframe that reduces this large dataframe into a smaller one, by combing all words that are the same. I also want to sum together their likes and number of comments and then I want to add a column that adds how many times they appeared in the original dataframe. That is, I want something like this:

In [2]: df2 Out[2]:

    likes num_comments words frequency
0    419   2035         song  2
1    939   7303         bird  4  
2    961   11115        pie   2
4    628    6155        lucky 2

etc...

But I am not even sure if it is possible. Note that I want to combine a ton of unique words in a very large dataframe. I tried to just combine the words with:

df1 = df.grouby('word')

But when I printed, I got this output:

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001146E8883D0>
  • So, from the suggested [duplicate](https://stackoverflow.com/questions/38174155/group-dataframe-and-get-sum-and-count), you need `df.groupby('words', sort=False).agg(likes=('likes','sum'),num_comments=('num_comments','sum'),frequency=('words','count')).reset_index(drop=False)` and then reorder the cols if you want the exact order of cols like in your `out`. – ouroboros1 Aug 21 '22 at 20:52

2 Answers2

1

You can do:

frequency = df["words"].value_counts().tolist()
out = (
    df.groupby("words", as_index=False)
    .agg({col: "sum" for col in df.columns.drop("words")})
    .assign(frequency=frequency)
)
SomeDude
  • 13,876
  • 5
  • 21
  • 44
0

You can make a pivot table

df_pivot = pd.pivot_table(df,
                    index='words',
                    aggfunc=['sum', 'count']).reset_index()

and then drop extra column