Compute mean of a groupbby pandas

Question

I have a very large dataset about twitter. I want to be able to compute the mean tweets per hour published by the user. I was able to groupby the tweets per hour per user but now how can I compute the mean per hour?

I'm not able to write all the code since the dataset has been heavily preprocessed. In the dataset I have as column user_id and created_at which is a timestamp of the tweet published, so I sorted by created_at and than groupedby till hours

grouped_df = tweets_df.sort_values(["created_at"]).groupby([
    tweets_df['user_id'],
    tweets_df['created_at'].dt.year, 
    tweets_df['created_at'].dt.month,
    tweets_df['created_at'].dt.day,
    tweets_df['created_at'].dt.hour])

I can count the tweets per hours per user using

tweet_per_hour = grouped_df["created_at"].count()

print(tweet_per_hour)

what I obtain using this code is

user_id     created_at  created_at  created_at  created_at
678033      2012        3           11          2             1
                                                14            1
                                                17            1
                                                18            1
                        4           13          4             1
                                                             ..
3164941860  2020        4           30          7             6
                                                9             2
                        5           1           1             2
                                                9             6
                                    2           6             1
Name: created_at, Length: 3829888, dtype: int64

where the last column is the count of the tweets per hours

678033      2012        3           11          2             1

indicates that user the 678033 in the day 2012-03-11 in the range of hour between 2 o'clock and 3 o'clock made just 1 tweet.

I need to sum all the tweets per hour made by the user and compute a mean for that user So I want as output for example

user_id     average_tweets_per_hour
678033      4
665353      10

How can i do it?

EDIT This is the reproducible example I have df_t and df_u, df_u_new is what I want to get

import pandas as pd
import numpy as np

df_t = pd.DataFrame({'id_t': [0, 1, 2, 3], 'id_u': [1, 1, 1, 2], 'timestamp': ["2019-06-27 11:12:32", "2019-06-27 11:14:32", "2020-07-28 11:24:32", "2020-02-27 13:30:21"]})

print(df_t)

df_u = pd.DataFrame({'id_u': [1, 2]})

print()
print(df_u)

df_u_new = pd.DataFrame({'id_u': [1, 2], 'avg_t_per_h': [2, 1]})

print()  
print(df_u_new)

It looks like you could group `tweet_per_hour` by `user_id`, then take the mean. I'd like to post an answer, but I'd want to see a [reproducible example of input and output](/q/20109391/4518341) first. (For more info, see [mre].) — wjandrea, Nov 04 '22 at 18:18
In the example code you give, user 1 has 2 tweets at 11 in one day, and 1 tweet at 11 of another day. Do you want the result to be the mean for each day (in which case it would be [2,1]) or do you want the overall mean for every day (in which case should it be 1.5, as it would be mean([2,1]) )? — Yolao_21, Nov 08 '22 at 13:57

Vincent Rupp · Answer 1 · 2022-11-08T17:12:11.263

I'm not sure what the name of your columns are anymore, but it would be something like this:

grouped_df.reset_index().groupby("user_id").agg(avgTweetsPerHour = ('created_at','mean'))

As was commented above, I can't test this without enough information to reproduce it, but the .agg() goes beautifully with .groupby()

Update: To get the average per hour, you can do this in a two step process of a) get the count every hour b) average for each user

However, we need to make sure what we calculate reflects the end goal. If you group by year, day, hour of the tweets, then any hours that had zero tweets (by anyone) would not be included (this is perhaps unlikely). Furthermore, if a user started tweeting in 2022 then their average should probably not include the hours in 2021, right?

Let's say you want a user's average tweets per hour from when they started to the end of your data.

So let's use a different calculate: total tweets / hours they were active.

To prevent writing the same code repeatedly, I'll make a function to calculate the hours between two timestamps:

import math
def hours_rounded_up(datemax, datemin):
  totaldays = (datemax - datemin).days
  remainingseconds = (datemax - datemin).seconds
  hours = totaldays * 24 + remainingseconds/3600
  return math.floor(hours)+1

Note about the above: When you subtract dates, you get a timedelta object. You can extract the whole number of days and multiply that by 24 hours, then convert the extra seconds to hours and add them. We round that down (floor function) and add 1 because we want the total number of hour intervals that are covered.

From there, we define the DataFrame, group by user id, and then divide the total tweet count by the hours using a lambda function in the aggregation:

from datetime import datetime
created_at_strings = ["2019-06-27 11:12:32", "2019-06-27 11:14:32", "2020-07-28 11:24:32", "2020-02-27 13:30:21"]
created_at = [datetime.strptime(x, "%Y-%m-%d %H:%M:%S") for x in created_at_strings]
df_t = pd.DataFrame({'id_t': [0, 1, 2, 3], 'id_u': [1, 1, 1, 2], 'created_at': created_at})

maxtimestamp = df_t["created_at"].max()
g = df_t.groupby(["id_u"]).agg(tweetsTotal = ("id_t","count"),
                               timespan_hours = ("created_at", lambda x: hours_rounded_up(maxtimestamp, x.min())),
                               tweetsPerHour = ("created_at", lambda x: x.count()/hours_rounded_up(maxtimestamp, x.min()))
                              )
df_final = g.reset_index()

The output is:

    id_u    tweetsTotal timespan_hours  tweetsPerHour
0   1       3           9529            0.000315
1   2       1           3646            0.000274

You don't actually need the tweetsTotal and timespan_hours columns, but they may be useful for validity testing.

That doesn't look exactly like your output from df_u_new because of my assumptions about what you were using for the timeframe to average over. You could replace my variable maxtimestamp with x.max() if you want to average only the user's timeframe.

Hope that helps! :)

Thanks for the answer I added an example – JayJona Nov 05 '22 at 09:19 — JayJona, Nov 05 '22 at 09:19

score 1 · Accepted Answer · answered Nov 08 '22 at 14:10

You can try using pd.resample. Based on the sample code you shared:

df_t.set_index(pd.to_datetime(df_t['timestamp']),drop=True).groupby('id_u').resample('1h')[['id_t']].count().replace(0, np.nan).dropna()

Out:
                          id_t
id_u timestamp                
1    2019-06-27 11:00:00   2.0
     2020-07-28 11:00:00   1.0
2    2020-02-27 13:00:00   1.0

The output shows the count of tweets, per user, per day and per hour. Afterwards, you can easily compute the mean tweet per hour per user overall, by using groupby


out = df_t.set_index(pd.to_datetime(df_t['timestamp']), drop=True).groupby('id_u').resample('1h')[['id_t']].count().replace(0, np.nan).dropna()

print(out.groupby('id_u').mean().rename(columns={'id_t':'tweets_per_h'}))

Out:
      tweets_per_h
id_u              
1              1.5
2              1.0

yes thanks. this is what I was searching for and the easiest one! Thanks — JayJona, Nov 09 '22 at 16:21

Sarah Bennett · Answer 3 · 2022-11-04T19:48:33.790

0

You may be able to do something as simple as

df.groupby('A').mean()

But as the other responses have noted it's difficult to know exactly what to suggest without something reproducible.

edited Nov 04 '22 at 19:48

answered Nov 04 '22 at 19:47

Sarah Bennett

21
2

Thanks for the answer I added an example – JayJona Nov 05 '22 at 09:18

score 0 · Answer 4 · answered Nov 08 '22 at 15:07

in doubt you should try to just iterate over the groupby-object because then you get to work with a normal dataframe which in my experience is way easier to work with. so something like:

for index, group in tweets_df.groupby('user_id'):
    ...

gives you a clean datframe for each different user in the form of group (also maybe just try debugging sometimes to see exactly where the values are that you need). hope this might help :)

Compute mean of a groupbby pandas

4 Answers4