3

I am a primarily JS developer who is trying to pick up pandas and perform some data analysis. Part of this analysis includes converting a team's match performances (win/loss) to a numeric rating (based off win percentage).

TLDR: I'm trying to get from DF 1 to DF 3.

DF 1

|   season  | opponent  |   outcome |
-------------------------------------
|   2020    |   A       |   w       |
|   2020    |   A       |   l       |
|   2020    |   B       |   w       |
|   2020    |   B       |   w       |
|   2020    |   C       |   l       |
|   2020    |   C       |   l       |
|   2021    |   A       |   w       |
|   2021    |   A       |   w       |
|   2021    |   B       |   w       |
|   2021    |   B       |   l       |
|   2021    |   C       |   w       |
|   2021    |   C       |   w       |

I need to calculate the winning percentage, grouped by season and opponent.

DF 2

|   season  | opponent  |  win %    |
-------------------------------------
|   2020    |   A       |   50      |
|   2020    |   B       |   100     |
|   2020    |   C       |   0       |
|   2021    |   A       |   100     |
|   2021    |   B       |   50      |
|   2021    |   C       |   100     |

After that, we need to calculate the rating per season. This is done by averaging the win % across teams in the same season, but with the caveat that the win % agains team A is worth twice that of other teams. This is just arbitrary formula, the actual calculation is more complex (different opponents have different weights - I need a way to pass that as part of a custom Lambda function or something) but I'm trying to simplify things for this question.

DF 3

|   season  |   rating  |
-------------------------
|   2020    |   50.0    |
|   2021    |   87.5    |

Rating calculations example: Season 2020 rating = team A % * 2 + team B win % + team C win % / (total no of teams + 1) = (50% * 2 + 100% + 0%) / (3 + 1) = 50.0

How can we get from the first to the last dataframe using pandas? I am able to get to a version of DF 2 by using the following

df2 = df1.groupby(["season", "opponent"])["outcome"].value_counts(normalize = True).to_frame()

This frame includes percentages for losses which are not needed, but it should not matter if I am able to filter/drop that as part of the "transformation" to DF 3.

I have been trying to do things like df2 = df2[df2["outcome"] != "w"], or df2 = df2.query('outcome != "w"') to drop the additional rows with the loss condition based on answers to another question, but to no avail. I suspect this is because outcome is a nested column. Have also noticed this question but what I think I need is a "wildcard" to access the nested outcome column regardless of opponent.

Note: if there are more efficient ways to get from DF 1 to DF 3 directly (this seems close but not quite), I'm happy to explore those methods too.

Alvin Teh
  • 778
  • 1
  • 10
  • 17

2 Answers2

2

You can get df2 as follows:

df2 = (df1.groupby(["season", "opponent"])["outcome"]
          .value_counts(normalize=True)
          .unstack(fill_value=0).stack(dropna=False)
          .mul(100)
          .reset_index(name='win %')
          .query('outcome == "w"')
      ).reset_index(drop=True)

Result

print(df2)

   season opponent outcome  win %
0    2020        A       w   50.0
1    2020        B       w  100.0
2    2020        C       w    0.0
3    2021        A       w  100.0
4    2021        B       w   50.0
5    2021        C       w  100.0

Then, to use the formula to get df3, you can use:

df2a = df2.set_index('season')

# Get: (team A % * 2 + team B win % + team C win %)
df3_x = (df2a.loc[df2a['opponent'] =='A', 'win %'] * 2 
             + df2a.loc[df2a['opponent'] =='B', 'win %'] 
             + df2a.loc[df2a['opponent'] =='C', 'win %']
        )

# Get (total no of teams + 1) for a particular season
df3_y = df2.groupby('season')['opponent'].count() + 1

df3 = (df3_x / df3_y).reset_index(name='rating')

Result

print(df3)

   season  rating
0    2020    50.0
1    2021    87.5

For your reference, here is the interim results during derivation of df3:

# team A % * 2 + team B win % + team C win % 
print(df3_x)

season
2020    200.0
2021    350.0
Name: win %, dtype: float64
# (total no of teams + 1) for a particular season
print(df3_y)

season
2020    4
2021    4
Name: opponent, dtype: int64
SeaBean
  • 22,547
  • 3
  • 13
  • 25
1
import pandas as pd

df_test = pd.DataFrame(data={'season':[2020]*6 + [2021]*6, 'opponent': ['A', 'A', 'B', 'B', 'C', 'C']*2,
                        'outcome': ['w', 'l', 'w', 'w', 'l', 'l', 'w', 'w', 'w', 'l', 'w', 'w']})

df_weightage = pd.DataFrame(data={'season':[2020]*3 + [2021]*3, 'opponent': ['A', 'B', 'C']*2,
                        'weightage': [0.2, 0.3, 0.5, 0.1, 0.2, 0.7]})

print(df_test)
print('='*30)
print(df_weightage)
print('='*35)

def get_pct(data):
    return len(data[data == 'w'])/len(data)

def get_rating(data):
    return sum(data['win_percentage']*data['weightage'])/len(data)

df_test = df_test.groupby(["season", "opponent"])["outcome"].apply(get_pct).rename('win_percentage').reset_index()
print(df_test)
print('='*45)

df_test = df_test.merge(df_weightage, how= 'left', on=['season', 'opponent'])
print(df_test)
print('='*45)

df_ratings = df_test.groupby(['season'])[['win_percentage', 'weightage']].apply(get_rating).rename('ratings').reset_index()
print(df_ratings)
Muhammad Hassan
  • 4,079
  • 1
  • 13
  • 27
  • Thanks for your answer! I've verified it works but it would be better if you could add some comments so that it's easier to understand. This is also considering you're using a weight column as well. – Alvin Teh Aug 31 '21 at 12:50