I have a dataframe of tweets and I'm looking to group the dataframe by date and generate a column that contains a cumulative list of all the unique users who have posted up to that date. None of the existing functions (e.g., cumsum) would appear to work for this. Here's a sample of the original tweet dataframe, where the index (created_at) is in datetime format:
In [3]: df
Out[3]:
screen_name
created_at
04-01-16 Bob
04-01-16 Bob
04-01-16 Sally
04-01-16 Sally
04-02-16 Bob
04-02-16 Miguel
04-02-16 Tim
I can collapse the dataset by date and get a column with the unique users per day:
In [4]: df[['screen_name']].groupby(df.index.date).aggregate(lambda x: set(list(x)))
Out[4]: from_user_screen_name
2016-04-02 {Bob, Sally}
2016-04-03 {Bob, Miguel, Tim}
So far so good. But what I'd like is to have a "cumulative set" like this:
Out[4]: Cumulative_list_up_to_this_date Cumulative_number_of_unique_users
2016-04-02 {Bob, Sally} 2
2016-04-03 {Bob, Sally, Miguel, Tim} 4
Ultimately, what I am really interested in is the cumulative number in the last column so I can plot it. I've considered looping over dates and other things but can't seem to find a good way. Thanks in advance for any help.