I have a pandas Dataframe of tennis games with 70,000 games (rows) with two issues:
Every game is duplicated, because for every game between player A and B, there's a row when A plays with B and a row when B plays with A. This happens because I extracted all games played for each player, so I have all games that Nadal played, and then all games that Federer played. For the games I extracted from Nadal's page, Nadal is player A and Federer is player B, and for the games I extracted from Federer's page, Federer is player A and Nadal is player B.
The second issue is that for every game, I only have info about player A, so using the example mentioned before, for the games I extracted where Nadal is player A, facing Federer, I have Nadal's height, age and ranking, but I don't have that info for Federer. And for the games I extracted where Federer is player A, facing Nadal, I have Federer's height, age and ranking, but I don't have that info for Nadal
Bellow is the example of the data for a better understanding:
Player A | Rank | Height | Age | Tourn. | Year | Round | Player B | Result |
---|---|---|---|---|---|---|---|---|
Nadal | 3 | 185 | 37 | US Open | 2019 | Finals | Federer | W |
Federer | 7 | 183 | 40 | US Open | 2019 | Finals | Nadal | L |
My objective is to add in the same row the information of both players like this:
Player A | Rank | Height | Age | Tourn. | Year | Round | Player B | Rank_B | Height_B | Age_B | Result |
---|---|---|---|---|---|---|---|---|---|---|---|
Nadal | 3 | 185 | 37 | US Open | 2019 | Finals | Federer | 7 | 183 | 40 | W |
And then remove all duplicate lines.
I have already solved the issue by doing a for loop inside a for loop and comparing every line. Once the criteria I set is met I proceed to change the lines. I consider that a game is duplicate if in the same year, tournament and round, the same players face each other.
import pandas as pd
import numpy as np
games = pd.read_csv("games.csv")
# create the new columns to add info of opponent:
games["Rank_B"] = np.nan
games["Height_B"] = np.nan
games["Age_B"] = np.nan
# loop through every line:
for i in range(0,len(games)):
# if the row was already mark to delete skip it
if games.loc[i, "p_name"] == "Delete":
next
# for each line compare it to every line:
for j in range(0,len(games)):
if games.loc[i, "Tourn."] == games.loc[j, "Tourn."] and games.loc[i, "Year"] == games.loc[j, "Year"] and games.loc[i, "Round"] == games.loc[j, "Round"] and games.loc[i, "Player A"] == games.loc[j, "Player B"]:
games.loc[i, "Height_B"] = games.loc[j, "Height"]
games.loc[i, "Rank_B"] = games.loc[j, "Rank"]
games.loc[i, "Age_B"] = games.loc[j, "Age"]
# marks row to delete because it is duplicate:
games.loc[j, "p_name"] = "Delete"
break
games = games[games["p_name"].str.contains("Delete") == False]
The problem is that my solution is very slow, taking a whopping 12 hours to run for 70,000 rows. If I want to run this code with a dataframe of 1,000,000 rows this solution is impractical.
Can you think of a better way to accomplish my objective?