I am trying to run a groupby function in pandas. I appreciate that there are already a lot of posts answering this but I can't get any of the answers to work. I have data as follows:
This is the input
index | created_at | full_text |
---|---|---|
0 | date time object | lots of text in here |
1 | date time object | lots of text in here |
I need to group the data by the created_at column so that text posted on the same date are grouped together and get the size of the group. I need to keep all the other columns.
I have tried various implementations as variations on the code below. I have also tried converting the datetime object to be just the date but that doesn't group the column properly. just simply trying:
df_mh.groupby(df_mh['created_at', 'full_text]).size()
Doesn't group the date properly.
This is my code at the moment:
df_mh = df_final.loc[df_final['full_text'].str.contains('mental health')]
mh_grouped = df_mh.groupby(df_mh.created_at.dt.date, as_index = False)['full_text'].size()
This is the output with the current code. So everything works except the full_text column is missing
index | created_at | size |
---|---|---|
0 | 2020-07-12 | 2 |
1 | 2020-10-12 | 1 |
So this works but misses off the full_text column.
How can I write it so that all the columns are preserved?
Edit sample data:
{'created_at': Sun Mar 22 04:11:34 +0000 2020, 'full_text': this is the full text of the first post', 'created_at': Sun Mar 22 04:11:34 +0000 2020, 'full_text': this is the full text of the second post', 'created_at': Sun Mar 24 04:11:34 +0000 2020, 'full_text': this is the full text of the third post'}
So my required process is: convert time stamp to just the day, month and year and aggregate posts so those written on the same day will be grouped and their number counted. I hope that makes sense