1

I have a very large dataset about twitter. I want to be able to compute the mean tweets per hour published by the user. I was able to groupby the tweets per hour per user but now how can I compute the mean per hour?

I'm not able to write all the code since the dataset has been heavily preprocessed. In the dataset I have as column user_id and created_at which is a timestamp of the tweet published, so I sorted by created_at and than groupedby till hours

grouped_df = tweets_df.sort_values(["created_at"]).groupby([
    tweets_df['user_id'],
    tweets_df['created_at'].dt.year, 
    tweets_df['created_at'].dt.month,
    tweets_df['created_at'].dt.day,
    tweets_df['created_at'].dt.hour])

I can count the tweets per hours per user using

tweet_per_hour = grouped_df["created_at"].count()

print(tweet_per_hour)

what I obtain using this code is

user_id     created_at  created_at  created_at  created_at
678033      2012        3           11          2             1
                                                14            1
                                                17            1
                                                18            1
                        4           13          4             1
                                                             ..
3164941860  2020        4           30          7             6
                                                9             2
                        5           1           1             2
                                                9             6
                                    2           6             1
Name: created_at, Length: 3829888, dtype: int64

where the last column is the count of the tweets per hours

678033      2012        3           11          2             1

indicates that user the 678033 in the day 2012-03-11 in the range of hour between 2 o'clock and 3 o'clock made just 1 tweet.

I need to sum all the tweets per hour made by the user and compute a mean for that user So I want as output for example

user_id     average_tweets_per_hour
678033      4
665353      10

How can i do it?

EDIT This is the reproducible example I have df_t and df_u, df_u_new is what I want to get

import pandas as pd
import numpy as np

df_t = pd.DataFrame({'id_t': [0, 1, 2, 3], 'id_u': [1, 1, 1, 2], 'timestamp': ["2019-06-27 11:12:32", "2019-06-27 11:14:32", "2020-07-28 11:24:32", "2020-02-27 13:30:21"]})

print(df_t)

df_u = pd.DataFrame({'id_u': [1, 2]})

print()
print(df_u)

df_u_new = pd.DataFrame({'id_u': [1, 2], 'avg_t_per_h': [2, 1]})

print()  
print(df_u_new)
JayJona
  • 469
  • 1
  • 16
  • 41
  • use `.mean()` instead? – Joran Beasley Nov 04 '22 at 18:12
  • 1
    It looks like you could group `tweet_per_hour` by `user_id`, then take the mean. I'd like to post an answer, but I'd want to see a [reproducible example of input and output](/q/20109391/4518341) first. (For more info, see [mre].) – wjandrea Nov 04 '22 at 18:18
  • @wjandread Thanks for the answer I added an example – JayJona Nov 05 '22 at 09:18
  • In the example code you give, user 1 has 2 tweets at 11 in one day, and 1 tweet at 11 of another day. Do you want the result to be the mean for each day (in which case it would be [2,1]) or do you want the overall mean for every day (in which case should it be 1.5, as it would be mean([2,1]) )? – Yolao_21 Nov 08 '22 at 13:57

4 Answers4

1

I'm not sure what the name of your columns are anymore, but it would be something like this:

grouped_df.reset_index().groupby("user_id").agg(avgTweetsPerHour = ('created_at','mean'))

As was commented above, I can't test this without enough information to reproduce it, but the .agg() goes beautifully with .groupby()

Update: To get the average per hour, you can do this in a two step process of a) get the count every hour b) average for each user

However, we need to make sure what we calculate reflects the end goal. If you group by year, day, hour of the tweets, then any hours that had zero tweets (by anyone) would not be included (this is perhaps unlikely). Furthermore, if a user started tweeting in 2022 then their average should probably not include the hours in 2021, right?

Let's say you want a user's average tweets per hour from when they started to the end of your data.

So let's use a different calculate: total tweets / hours they were active.

To prevent writing the same code repeatedly, I'll make a function to calculate the hours between two timestamps:

import math
def hours_rounded_up(datemax, datemin):
  totaldays = (datemax - datemin).days
  remainingseconds = (datemax - datemin).seconds
  hours = totaldays * 24 + remainingseconds/3600
  return math.floor(hours)+1

Note about the above: When you subtract dates, you get a timedelta object. You can extract the whole number of days and multiply that by 24 hours, then convert the extra seconds to hours and add them. We round that down (floor function) and add 1 because we want the total number of hour intervals that are covered.

From there, we define the DataFrame, group by user id, and then divide the total tweet count by the hours using a lambda function in the aggregation:

from datetime import datetime
created_at_strings = ["2019-06-27 11:12:32", "2019-06-27 11:14:32", "2020-07-28 11:24:32", "2020-02-27 13:30:21"]
created_at = [datetime.strptime(x, "%Y-%m-%d %H:%M:%S") for x in created_at_strings]
df_t = pd.DataFrame({'id_t': [0, 1, 2, 3], 'id_u': [1, 1, 1, 2], 'created_at': created_at})

maxtimestamp = df_t["created_at"].max()
g = df_t.groupby(["id_u"]).agg(tweetsTotal = ("id_t","count"),
                               timespan_hours = ("created_at", lambda x: hours_rounded_up(maxtimestamp, x.min())),
                               tweetsPerHour = ("created_at", lambda x: x.count()/hours_rounded_up(maxtimestamp, x.min()))
                              )
df_final = g.reset_index()

The output is:

    id_u    tweetsTotal timespan_hours  tweetsPerHour
0   1       3           9529            0.000315
1   2       1           3646            0.000274

You don't actually need the tweetsTotal and timespan_hours columns, but they may be useful for validity testing.

That doesn't look exactly like your output from df_u_new because of my assumptions about what you were using for the timeframe to average over. You could replace my variable maxtimestamp with x.max() if you want to average only the user's timeframe.

Hope that helps! :)

Vincent Rupp
  • 617
  • 5
  • 13
1

You can try using pd.resample. Based on the sample code you shared:

df_t.set_index(pd.to_datetime(df_t['timestamp']),drop=True).groupby('id_u').resample('1h')[['id_t']].count().replace(0, np.nan).dropna()

Out:
                          id_t
id_u timestamp                
1    2019-06-27 11:00:00   2.0
     2020-07-28 11:00:00   1.0
2    2020-02-27 13:00:00   1.0

The output shows the count of tweets, per user, per day and per hour. Afterwards, you can easily compute the mean tweet per hour per user overall, by using groupby


out = df_t.set_index(pd.to_datetime(df_t['timestamp']), drop=True).groupby('id_u').resample('1h')[['id_t']].count().replace(0, np.nan).dropna()

print(out.groupby('id_u').mean().rename(columns={'id_t':'tweets_per_h'}))

Out:
      tweets_per_h
id_u              
1              1.5
2              1.0
Yolao_21
  • 765
  • 1
  • 4
  • 10
0

You may be able to do something as simple as

df.groupby('A').mean()

But as the other responses have noted it's difficult to know exactly what to suggest without something reproducible.

0

in doubt you should try to just iterate over the groupby-object because then you get to work with a normal dataframe which in my experience is way easier to work with. so something like:

for index, group in tweets_df.groupby('user_id'):
    ...

gives you a clean datframe for each different user in the form of group (also maybe just try debugging sometimes to see exactly where the values are that you need). hope this might help :)

maxxel_
  • 437
  • 3
  • 13