1

I have a dataset that looks like this:

    country            date_added
0   United States       01/2013
1   United Kingdom      03/2014
2   Egypt               03/2014
3   United States       03/2014
4   United States       03/2014
5   United Kingdom      06/2015
6   United States       06/2015

And I would like a running cumulative total of each country by date, ie:

    date_added         country         cumulative_count
0   01/2013             United States          1
1   03/2014             United Kingdom         1
2   03/2014             Egypt                  1
3   03/2014             United States          2
4   06/2015             United Kingdom         2
5   06/2015             United States          4

I tried grouping by two levels but .count() doesn't work (the count doesn't show up at all) whereas .size() does:

cumulative_by_date = new_df.groupby(['date_added','country']).size()

I don't know how to apply this question's solution with .size() to get a cumulative sum.

exlo
  • 315
  • 1
  • 8
  • 20
  • Unfortunately it does not since .size() is what is giving me a total count by time period and not .count(). Updated question and edited for clarity. – exlo Feb 17 '21 at 03:31
  • 1
    `new_df.groupby(['date_added', 'country']).size().groupby(level=1).cumsum().reset_index(name='cumulative_count')`' – BigBen Feb 17 '21 at 13:49
  • Thank you, this does work. What does the second groupby (ie groupby(level=1)) do? I thought both levels would be grouped in the first .groupby() – exlo Feb 17 '21 at 20:43
  • 1
    I thought the explanation / walk-through in the second linked question was helpful. – BigBen Feb 17 '21 at 20:45
  • the 3/24 united states count looks incorrect. It should be 2 not 3 – Golden Lion Feb 18 '21 at 17:29
  • That is true, I've edited the example so it is accurate – exlo Feb 21 '21 at 17:43

1 Answers1

2

Following the approach of the second linked question, here's a double groupby with cumsum and reset_index:

df.groupby(['date_added', 'country']).size()
  .groupby(['country']).cumsum().reset_index(name='cumulative_count')

Output:

  date_added         country  cumulative_count
0    01/2013   United States                 1
1    03/2014           Egypt                 1
2    03/2014  United Kingdom                 1
3    03/2014   United States                 3
4    06/2015  United Kingdom                 2
5    06/2015   United States                 4

In steps:

# size by date and country
print(df.groupby(['date_added', 'country']).size())

# output
date_added  country       
01/2013     United States     1
03/2014     Egypt             1
            United Kingdom    1
            United States     2
06/2015     United Kingdom    1
            United States     1
# cumulative sum by country
print(df.groupby(['date_added', 'country']).size()
        .groupby(['country']).cumsum())

# output
date_added  country       
01/2013     United States     1
03/2014     Egypt             1
            United Kingdom    1
            United States     3
06/2015     United Kingdom    2
            United States     4
# reset index
print(df.groupby(['date_added', 'country']).size()
        .groupby(['country']).cumsum().reset_index(name='cumulative_count'))

# output
  date_added         country  cumulative_count
0    01/2013   United States                 1
1    03/2014           Egypt                 1
2    03/2014  United Kingdom                 1
3    03/2014   United States                 3
4    06/2015  United Kingdom                 2
5    06/2015   United States                 4
BigBen
  • 46,229
  • 7
  • 24
  • 40