Pandas - cumsum() on groupby() over multiple axis

Question

I’m new to Pandas. I have a data set that is horse racing results. Example here:

RaceID  RaceDate    RaceMeet   Position Horse       Jockey          Trainer        RaceLength race win  HorseWinPercentage 
446252  01/01/2008  Southwell (AW)  1   clear reef  tom mclaughlin  jane chapple-hyam   3101    1   1   0
447019  14/01/2008  Southwell (AW)  5   clear reef  tom mclaughlin  jane chapple-hyam   2654    1   0   100
449057  21/02/2008  Southwell (AW)  2   clear reef  tom mclaughlin  jane chapple-hyam   3101    1   0   50
463805  26/08/2008  Chelmsford (AW) 6   clear reef  tom mclaughlin  jane chapple-hyam   3080    1   0   33.33333333
469220  27/11/2008  Chelmsford (AW) 3   clear reef  tom mclaughlin  jane chapple-hyam   3080    1   0   25
470195  11/12/2008  Chelmsford (AW) 5   clear reef  tom mclaughlin  jane chapple-hyam   3080    1   0   20
471052  26/12/2008  Wolhampton (AW) 1   clear reef  andrea atzeni   jane chapple-hyam   2690    1   1   16.66666667
471769  07/01/2009  Wolhampton (AW) 6   clear reef  ian mongan      jane chapple-hyam   2690    1   0   28.57142857
472137  13/01/2009  Chelmsford (AW) 2   clear reef  jamie spencer   jane chapple-hyam   3080    1   0   25
472213  20/01/2009  Southwell (AW)  5   clear reef  jamie spencer   jane chapple-hyam   2654    1   0   22.22222222
476595  25/03/2009  Kempton (AW)    4   clear reef  pat cosgrave    jane chapple-hyam   2639    1   0   20
477674  08/04/2009  Kempton (AW)    5   clear reef  pat cosgrave    jane chapple-hyam   2639    1   0   18.18181818
479098  21/04/2009  Kempton (AW)    3   clear reef  andrea atzeni   jane chapple-hyam   2639    1   0   16.66666667
492913  14/11/2009  Wolhampton (AW) 1   clear reef  andrea atzeni   jane chapple-hyam   3639    1   1   15.38461538
493720  25/11/2009  Kempton (AW)    3   clear reef  andrea atzeni   jane chapple-hyam   3518    1   0   21.42857143
495863  29/12/2009  Southwell (AW)  1   clear reef  shane kelly     jane chapple-hyam   3101    1   1   20

I want to be able to groupby() multiple axis to count up wins and create combination win percentages or results at specific track and lengths.

When I just need to groupby a single axis – it works great:

df['horse_win_count'] = df.groupby(['Horse'])['win'].cumsum()
df['horse_race_count'] = df.groupby(['Horse'])['race'].cumsum()
df['HorseWinPercentage2'] = df['horse_win_count'] / df['horse_race_count'] * 100
df['HorseWinPercentage'] = df.groupby('Horse')['HorseWinPercentage2'].shift(+1)

However when I need to groupby more than one axis I get some really weird results.

For example I was to create a win percentage for when a specific Jockey rides a specific Trainers’ horse – groupby([‘Jockey’,’Trainer’]). Then I need to know the percentage as it changes for each individual row (race).

df['jt_win_count'] = df.groupby(['Jockey','Trainer'])['win'].cumsum()
df['jt_race_count'] = df.groupby(['Jockey','Trainer'])['race'].cumsum()
df['JTWinPercentage2'] = df['jt_win_count'] / df['jt_race_count'] * 100
df['JTWinPercentage'] = df.groupby(['Jockey','Trainer'])['JTWinPercentage2'].shift(+1)
df['JTWinPercentage'].fillna(0, inplace=True)

Or I want to count up the number of times a horse has won over that course and that distance. So I need to groupby([‘Horse’, ‘RaceMeet’,’RaceLength’]):

df['CD'] = df.groupby([‘RaceMeet’,’RaceLength’,’Horse’])[‘win’].cumsum()
df['CD'] = df.groupby(["RaceMeet","RaceLength","Horse"]).shift(+1)

I get results in the 10s of 1000s.

How can I groupby several axis, make a computation and shift the results back by one entry while grouped by several entries?

And even better can you explain why my code above doesn’t work? Like I say I’m new to Pandas and keen to learn.

Cheers.

for the `CD`, it should be `df['CD'] = df.groupby(...)['CD'].shift()`. — Quang Hoang, Sep 25 '19 at 11:54

score 1 · Answer 1 · answered Sep 25 '19 at 11:58

1

Question was already asked: Pandas DataFrame Groupby two columns and get counts and here python pandas groupby() result

I do not really know what your goal is though. I guess you should first add another column with the new parameter you want to group by. for example: df['jockeyTrainer']=df.loc['Jockey']+df.loc['Trainer'] Then you can use this to groupby. Or you follow the information in the links.

answered Sep 25 '19 at 11:58

kostja

355
1
9

Thanks for linking to those other answers, but they both are asking for a total sum, rather than a cumulative sum. So neither work here. I'll try your other suggestion, but it feels like there should be more direct solution than a work around. Cheers. – Niazipan Sep 25 '19 at 13:46
If you have a look at the linked questions you will see how groupby two columns works. This furthermore leads to the answer of your question why your code does not work. You can use your code above, but if you want to use cumsum on the result you first need to decide on the column you want to cumsum. In the example you gave, the choice would be the Jockey i guess... – kostja Sep 26 '19 at 08:52

Pandas - cumsum() on groupby() over multiple axis

1 Answers1