I am a primarily JS developer who is trying to pick up pandas and perform some data analysis. Part of this analysis includes converting a team's match performances (win/loss) to a numeric rating (based off win percentage).
TLDR: I'm trying to get from DF 1 to DF 3.
DF 1
| season | opponent | outcome |
-------------------------------------
| 2020 | A | w |
| 2020 | A | l |
| 2020 | B | w |
| 2020 | B | w |
| 2020 | C | l |
| 2020 | C | l |
| 2021 | A | w |
| 2021 | A | w |
| 2021 | B | w |
| 2021 | B | l |
| 2021 | C | w |
| 2021 | C | w |
I need to calculate the winning percentage, grouped by season and opponent.
DF 2
| season | opponent | win % |
-------------------------------------
| 2020 | A | 50 |
| 2020 | B | 100 |
| 2020 | C | 0 |
| 2021 | A | 100 |
| 2021 | B | 50 |
| 2021 | C | 100 |
After that, we need to calculate the rating per season. This is done by averaging the win % across teams in the same season, but with the caveat that the win % agains team A is worth twice that of other teams. This is just arbitrary formula, the actual calculation is more complex (different opponents have different weights - I need a way to pass that as part of a custom Lambda function or something) but I'm trying to simplify things for this question.
DF 3
| season | rating |
-------------------------
| 2020 | 50.0 |
| 2021 | 87.5 |
Rating calculations example: Season 2020 rating = team A % * 2 + team B win % + team C win % / (total no of teams + 1) = (50% * 2 + 100% + 0%) / (3 + 1) = 50.0
How can we get from the first to the last dataframe using pandas? I am able to get to a version of DF 2 by using the following
df2 = df1.groupby(["season", "opponent"])["outcome"].value_counts(normalize = True).to_frame()
This frame includes percentages for losses which are not needed, but it should not matter if I am able to filter/drop that as part of the "transformation" to DF 3.
I have been trying to do things like df2 = df2[df2["outcome"] != "w"]
, or df2 = df2.query('outcome != "w"')
to drop the additional rows with the loss condition based on answers to another question, but to no avail. I suspect this is because outcome
is a nested column. Have also noticed this question but what I think I need is a "wildcard" to access the nested outcome
column regardless of opponent
.
Note: if there are more efficient ways to get from DF 1 to DF 3 directly (this seems close but not quite), I'm happy to explore those methods too.