0

I recently used groupby() and np.cumsum() to cumulatively count all the games the winner of a current tennis match won within a given tournament. I used this code to get the column winner_tourney_games_cumulative in the first table:

df.groupby(['tourney_name', 'year', 'winner_name'])['winner_games_played'].cumsum()

dataset of tennis matches for one tournament:

winner_name loser_name tourney_name round year winner_tourney_games_cumulative winner_games_played loser_games_played
Roger Federer Sumit Nagal US Open R128 2019 22 22 13
Roger Federer Damir Dzumhur US Open R64 2019 43 21 15
Roger Federer Daniel Evans US Open R32 2019 61 18 5
Roger Federer David Goffin US Open R16 2019 79 18 4
Grigor Dimitrov Roger Federer US Open QF 2019 87 24 22

This is relatively straightforward for the winners, but not so much for the losers. Given that the loser in the current match made it past the first round of the tournament, this means they have won at least one match before, and hence was the winner of a previous match. This complicates things since the column for winners and losers are separate in the data. So, I believe that the short code with np.cumsum()I wrote (shown above) won't work for the losers.

Can anyone help me on how to create something similar for the losers as I did for the winners (the column winner_tourney_games_cumulative)? What I would like is the cumulative sum of the number of games the loser has won in previous matches of this tournament.

I am envisioning to look something like this for the losers (you can just ignore the x's below):

winner_name loser_name loser_tourney_games_cumulative
Roger Federer Sumit Nagal X
Roger Federer Damir Dzumhur X
Roger Federer Daniel Evans X
Roger Federer David Goffin X
Grigor Dimitrov Roger Federer 101

For Roger Federer in the match he lost, it should be 79 (the cumulative games he won in first table from past matches) + 22 (how many games he won in this match) = 101.

Thanks so much in advance!

Rowan P.
  • 1
  • 1
  • 1
    [Please do not post images of data/code when posting a question](https://meta.stackoverflow.com/questions/285551/why-not-upload-images-of-code-errors-when-asking-a-question), please edit your question to include a [reproducible example](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). – anky Mar 28 '21 at 09:59

1 Answers1

1

This requires a few transformations. We start by melting the df to list winners and losers line by line, for each game:

df2 = (df.melt(id_vars ='round', value_vars = ['winner_name','loser_name'], var_name = 'win_lose', value_name = 'name')
   .join(
 df.melt(id_vars ='round', value_vars = ['winner_games_played','loser_games_played'], value_name = 'games_won')['games_won'] 
   )
)
df2

we get

    round    win_lose     name               games_won
--  -------  -----------  ---------------  -----------
 0  R128     winner_name  Roger Federer             22
 1  R64      winner_name  Roger Federer             21
 2  R32      winner_name  Roger Federer             18
 3  R16      winner_name  Roger Federer             18
 4  QF       winner_name  Grigor Dimitrov           24
 5  R128     loser_name   Sumit Nagal               13
 6  R64      loser_name   Damir Dzumhur             15
 7  R32      loser_name   Daniel Evans               5
 8  R16      loser_name   David Goffin               4
 9  QF       loser_name   Roger Federer             22

Now we can calc cumsum by player name:

df2['cum_games'] = df2.groupby('name').cumsum()
df2

so we get

    round    win_lose     name               games_won    cum_games
--  -------  -----------  ---------------  -----------  -----------
 0  R128     winner_name  Roger Federer             22           22
 1  R64      winner_name  Roger Federer             21           43
 2  R32      winner_name  Roger Federer             18           61
 3  R16      winner_name  Roger Federer             18           79
 4  QF       winner_name  Grigor Dimitrov           24           24
 5  R128     loser_name   Sumit Nagal               13           13
 6  R64      loser_name   Damir Dzumhur             15           15
 7  R32      loser_name   Daniel Evans               5            5
 8  R16      loser_name   David Goffin               4            4
 9  QF       loser_name   Roger Federer             22          101

Perhaps this is sufficient for your purposes. But if not, we get it back into (more or less) original shape via groupby and unstack

df2.groupby(['round','win_lose'], sort = False).first().unstack(level = 1)

to get

            name                          games_won                 cum_games
win_lose    winner_name     loser_name    winner_name   loser_name  winner_name loser_name
round                       
R128        Roger Federer   Sumit Nagal   22            13          22          13
R64         Roger Federer   Damir Dzumhur 21            15          43          15
R32         Roger Federer   Daniel Evans  18             5          61          5
R16         Roger Federer   David Goffin  18             4          79          4
QF          Grigor Dimitrov Roger Federer 24            22          24          101

Apart from somewhat different column names, the data is what you want, ie the ('cum_games', 'loser_name') column is the cumulative number of games won by the loser of each round

piterbarg
  • 8,089
  • 2
  • 6
  • 22