Pandas count over groups

Question

I have a pandas dataframe that looks as follows:

ID  round   player1     player2 
1   1       A           B           
1   2       A           C
1   3       B           D
2   1       B           C           
2   2       C           D
2   3       C           E
3   1       B           C           
3   2       C           D
3   3       C           A

The dataframe contains sport match results, where the ID column denotes one tournament, the round column denotes the round for each tournament, and player1 and player2 columns contain the names of players that played against eachother in the respective round.

I now want to cumulatively count the tournament participations for, say, player A. In pseudocode this means: If the player with name A comes up in either the player1 or player2 column per tournament ID, increment the counter by 1.

The result should look like this (note: in my example player A did participate in tournaments with the IDs 1 and 3):

ID  round   player1     player2     playerAparticipated
1   1       A           B           1
1   2       A           C           1
1   3       B           D           1
2   1       B           C           0
2   2       C           D           0
2   3       C           E           0
3   1       B           C           2
3   2       C           D           2
3   3       C           A           2

My current status is, that I added a "helper" column containing the values 1 or 0 denoting, if the respective player participated in the tournament:

ID  round   player1     player2     helper
1   1       A           B           1
1   2       A           C           1
1   3       B           D           1
2   1       B           C           0
2   2       C           D           0
2   3       C           E           0
3   1       B           C           1
3   2       C           D           1
3   3       C           A           1

I think that I just need one final step, e.g., a smart use of cumsum() that counts the helper column in the desired way. However, I could not come up with the solution yet.

Do you need the values inside the dataframe? `len(df[(df.player1 == 'A') | (df.player2 == 'A')].groupby('ID').count())` would give you the number of tournaments Player 'A' participated. — rafasc, Jul 15 '17 at 17:02
yes, I would prefer to have the values in the original dataframe. — beta, Jul 15 '17 at 17:13

jezrael · Accepted Answer · 2017-07-15T19:33:49.850

I think you need:

drop_duplicates by column ID first and then set_index
filter out 0 values by boolean indexing, cumsum and last reindex for add 0 for missing index values
new column create by map

df1 = df.drop_duplicates('ID').set_index('ID')
s = df1.loc[df1['helper'] != 0, 'helper'].cumsum().reindex(index=df1.index, fill_value=0)
df['playerAparticipated'] = df['ID'].map(s)
print (df)
   ID  round player1 player2  helper  playerAparticipated
0   1      1       A       B       1                    1
1   1      2       A       C       1                    1
2   1      3       B       D       1                    1
3   2      1       B       C       0                    0
4   2      2       C       D       0                    0
5   2      3       C       E       0                    0
6   3      1       B       C       1                    2
7   3      2       C       D       1                    2
8   3      3       C       A       1                    2

Instead map is possible use join with rename:

df = df.join(s.rename('playerAparticipated'), on='ID')
print (df)
   ID  round player1 player2  helper  playerAparticipated
0   1      1       A       B       1                    1
1   1      2       A       C       1                    1
2   1      3       B       D       1                    1
3   2      1       B       C       0                    0
4   2      2       C       D       0                    0
5   2      3       C       E       0                    0
6   3      1       B       C       1                    2
7   3      2       C       D       1                    2
8   3      3       C       A       1                    2

one question related to your answer: why exactly is `set_index` needed? what is the purpose? — beta, Jul 15 '17 at 19:37
In my solution it let column untouched by creating `index` - because need process `helper` column — jezrael, Jul 15 '17 at 19:39

score 1 · Answer 2 · answered Jul 15 '17 at 17:04

A similar approach to @jezrael that I cooked up a little slower :).

First, move ID into your index:

df = df.reset_index().set_index(['index','ID'])
#          round player1 player2  helper
# index ID
# 0     1       1       A       B       1
# 1     1       2       A       C       1
# 2     1       3       B       D       1
# 3     2       1       B       C       0
# 4     2       2       C       D       0
# 5     2       3       C       E       0
# 6     3       1       B       C       1
# 7     3       2       C       D       1
# 8     3       3       C       A       1

Next, filter out rows where helper is 0 and get a cumulative sum of tournaments by ID, and assign the result to a variable:

tournament_count = df[df['helper'] > 0].groupby(['ID','helper']).first().reset_index(level=1)['helper'].cumsum().rename("playerAparticipated")
# ID
# 1    1
# 3    2

Finally, join the tournament_count DataFrame with df:

df.join(tournament_counts, how="left").fillna(0)
#          round player1 player2  helper  tournament_counts
# index ID
# 0     1       1       A       B       1                1.0
# 1     1       2       A       C       1                1.0
# 2     1       3       B       D       1                1.0
# 3     2       1       B       C       0                0.0
# 4     2       2       C       D       0                0.0
# 5     2       3       C       E       0                0.0
# 6     3       1       B       C       1                2.0
# 7     3       2       C       D       1                2.0
# 8     3       3       C       A       1                2.0

Pandas count over groups

2 Answers2

Linked