-1

I am trying to run a groupby function in pandas. I appreciate that there are already a lot of posts answering this but I can't get any of the answers to work. I have data as follows:

This is the input

index created_at full_text
0 date time object lots of text in here
1 date time object lots of text in here

I need to group the data by the created_at column so that text posted on the same date are grouped together and get the size of the group. I need to keep all the other columns.

I have tried various implementations as variations on the code below. I have also tried converting the datetime object to be just the date but that doesn't group the column properly. just simply trying:

df_mh.groupby(df_mh['created_at', 'full_text]).size()

Doesn't group the date properly.

This is my code at the moment:

df_mh = df_final.loc[df_final['full_text'].str.contains('mental health')]

mh_grouped = df_mh.groupby(df_mh.created_at.dt.date, as_index = False)['full_text'].size()

This is the output with the current code. So everything works except the full_text column is missing

index created_at size
0 2020-07-12 2
1 2020-10-12 1

So this works but misses off the full_text column.

How can I write it so that all the columns are preserved?

Edit sample data:

{'created_at': Sun Mar 22 04:11:34 +0000 2020, 'full_text': this is the full text of the first post', 'created_at': Sun Mar 22 04:11:34 +0000 2020, 'full_text': this is the full text of the second post', 'created_at': Sun Mar 24 04:11:34 +0000 2020, 'full_text': this is the full text of the third post'}

So my required process is: convert time stamp to just the day, month and year and aggregate posts so those written on the same day will be grouped and their number counted. I hope that makes sense

frogger
  • 31
  • 4

3 Answers3

0

Using transform on a groupby object will return a table of the same size which you can then set as your new column.

df["size"] = df.groupby("created_at").transform("count")
Michael Cao
  • 2,278
  • 1
  • 1
  • 13
  • Thank you for your response, I have tried that and it still misses off the full_text column and gives a bunch of NaN in the size column. What am I doing wrong? – frogger Jun 19 '23 at 19:29
0

You can join the original dataframe to grouped dataframe by created_at field. Something like that:

df \
    .groupby("created_at", as_index=False) \
    ["full_text"] \
    .size() \
    .merge(df, on="created_at")
Maria K
  • 1,491
  • 1
  • 3
  • 14
0

try this:

df=df.groupby("created_at",as_index=False).size().assign(full_text=df["full_text"])
Himanshu Panwar
  • 216
  • 2
  • 7