Using np.cumsum within a group in pandas dataframe

Question

I recently used groupby() and np.cumsum() to cumulatively count all the games the winner of a current tennis match won within a given tournament. I used this code to get the column winner_tourney_games_cumulative in the first table:

df.groupby(['tourney_name', 'year', 'winner_name'])['winner_games_played'].cumsum()

dataset of tennis matches for one tournament:

winner_name	loser_name	tourney_name	round	year	winner_tourney_games_cumulative	winner_games_played	loser_games_played
Roger Federer	Sumit Nagal	US Open	R128	2019	22	22	13
Roger Federer	Damir Dzumhur	US Open	R64	2019	43	21	15
Roger Federer	Daniel Evans	US Open	R32	2019	61	18	5
Roger Federer	David Goffin	US Open	R16	2019	79	18	4
Grigor Dimitrov	Roger Federer	US Open	QF	2019	87	24	22

This is relatively straightforward for the winners, but not so much for the losers. Given that the loser in the current match made it past the first round of the tournament, this means they have won at least one match before, and hence was the winner of a previous match. This complicates things since the column for winners and losers are separate in the data. So, I believe that the short code with np.cumsum()I wrote (shown above) won't work for the losers.

Can anyone help me on how to create something similar for the losers as I did for the winners (the column winner_tourney_games_cumulative)? What I would like is the cumulative sum of the number of games the loser has won in previous matches of this tournament.

I am envisioning to look something like this for the losers (you can just ignore the x's below):

winner_name	loser_name	loser_tourney_games_cumulative
Roger Federer	Sumit Nagal	X
Roger Federer	Damir Dzumhur	X
Roger Federer	Daniel Evans	X
Roger Federer	David Goffin	X
Grigor Dimitrov	Roger Federer	101

For Roger Federer in the match he lost, it should be 79 (the cumulative games he won in first table from past matches) + 22 (how many games he won in this match) = 101.

Thanks so much in advance!

[Please do not post images of data/code when posting a question](https://meta.stackoverflow.com/questions/285551/why-not-upload-images-of-code-errors-when-asking-a-question), please edit your question to include a [reproducible example](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). — anky, Mar 28 '21 at 09:59

score 1 · Answer 1 · answered Mar 29 '21 at 20:25

This requires a few transformations. We start by melting the df to list winners and losers line by line, for each game:

df2 = (df.melt(id_vars ='round', value_vars = ['winner_name','loser_name'], var_name = 'win_lose', value_name = 'name')
   .join(
 df.melt(id_vars ='round', value_vars = ['winner_games_played','loser_games_played'], value_name = 'games_won')['games_won'] 
   )
)
df2

we get

    round    win_lose     name               games_won
--  -------  -----------  ---------------  -----------
 0  R128     winner_name  Roger Federer             22
 1  R64      winner_name  Roger Federer             21
 2  R32      winner_name  Roger Federer             18
 3  R16      winner_name  Roger Federer             18
 4  QF       winner_name  Grigor Dimitrov           24
 5  R128     loser_name   Sumit Nagal               13
 6  R64      loser_name   Damir Dzumhur             15
 7  R32      loser_name   Daniel Evans               5
 8  R16      loser_name   David Goffin               4
 9  QF       loser_name   Roger Federer             22

Now we can calc cumsum by player name:

df2['cum_games'] = df2.groupby('name').cumsum()
df2

so we get

    round    win_lose     name               games_won    cum_games
--  -------  -----------  ---------------  -----------  -----------
 0  R128     winner_name  Roger Federer             22           22
 1  R64      winner_name  Roger Federer             21           43
 2  R32      winner_name  Roger Federer             18           61
 3  R16      winner_name  Roger Federer             18           79
 4  QF       winner_name  Grigor Dimitrov           24           24
 5  R128     loser_name   Sumit Nagal               13           13
 6  R64      loser_name   Damir Dzumhur             15           15
 7  R32      loser_name   Daniel Evans               5            5
 8  R16      loser_name   David Goffin               4            4
 9  QF       loser_name   Roger Federer             22          101

Perhaps this is sufficient for your purposes. But if not, we get it back into (more or less) original shape via groupby and unstack

df2.groupby(['round','win_lose'], sort = False).first().unstack(level = 1)

to get

            name                          games_won                 cum_games
win_lose    winner_name     loser_name    winner_name   loser_name  winner_name loser_name
round                       
R128        Roger Federer   Sumit Nagal   22            13          22          13
R64         Roger Federer   Damir Dzumhur 21            15          43          15
R32         Roger Federer   Daniel Evans  18             5          61          5
R16         Roger Federer   David Goffin  18             4          79          4
QF          Grigor Dimitrov Roger Federer 24            22          24          101

Apart from somewhat different column names, the data is what you want, ie the ('cum_games', 'loser_name') column is the cumulative number of games won by the loser of each round

Using np.cumsum within a group in pandas dataframe

1 Answers1