0

I have a pandas Dataframe of tennis games with 70,000 games (rows) with two issues:

  1. Every game is duplicated, because for every game between player A and B, there's a row when A plays with B and a row when B plays with A. This happens because I extracted all games played for each player, so I have all games that Nadal played, and then all games that Federer played. For the games I extracted from Nadal's page, Nadal is player A and Federer is player B, and for the games I extracted from Federer's page, Federer is player A and Nadal is player B.

  2. The second issue is that for every game, I only have info about player A, so using the example mentioned before, for the games I extracted where Nadal is player A, facing Federer, I have Nadal's height, age and ranking, but I don't have that info for Federer. And for the games I extracted where Federer is player A, facing Nadal, I have Federer's height, age and ranking, but I don't have that info for Nadal

Bellow is the example of the data for a better understanding:

Player A Rank Height Age Tourn. Year Round Player B Result
Nadal 3 185 37 US Open 2019 Finals Federer W
Federer 7 183 40 US Open 2019 Finals Nadal L

My objective is to add in the same row the information of both players like this:

Player A Rank Height Age Tourn. Year Round Player B Rank_B Height_B Age_B Result
Nadal 3 185 37 US Open 2019 Finals Federer 7 183 40 W

And then remove all duplicate lines.

I have already solved the issue by doing a for loop inside a for loop and comparing every line. Once the criteria I set is met I proceed to change the lines. I consider that a game is duplicate if in the same year, tournament and round, the same players face each other.


import pandas as pd
import numpy as np

games = pd.read_csv("games.csv")

# create the new columns to add info of opponent:

games["Rank_B"] = np.nan
games["Height_B"] = np.nan
games["Age_B"] = np.nan


# loop through every line:

for i in range(0,len(games)):

    # if the row was already mark to delete skip it

    if games.loc[i, "p_name"] == "Delete":
        next

    # for each line compare it to every line:

    for j in range(0,len(games)):

        if games.loc[i, "Tourn."] == games.loc[j, "Tourn."] and games.loc[i, "Year"] == games.loc[j, "Year"] and games.loc[i, "Round"] == games.loc[j, "Round"] and games.loc[i, "Player A"] == games.loc[j, "Player B"]:

            games.loc[i, "Height_B"] = games.loc[j, "Height"]
            games.loc[i, "Rank_B"] = games.loc[j, "Rank"]
            games.loc[i, "Age_B"] = games.loc[j, "Age"]

            # marks row to delete because it is duplicate:

            games.loc[j, "p_name"] = "Delete"
            
            break

games = games[games["p_name"].str.contains("Delete") == False]

The problem is that my solution is very slow, taking a whopping 12 hours to run for 70,000 rows. If I want to run this code with a dataframe of 1,000,000 rows this solution is impractical.

Can you think of a better way to accomplish my objective?

Gil Ramos
  • 50
  • 8

2 Answers2

0

Try with merge:

df = pd.merge(left=df, right=df, on=['Tourn.','Round','Year'])

Then remove duplicates:

df.drop_duplicates(subset=['Tourn.','Round','Year'], inplace=True)

After you just need to rename the column names

You can then leave only rows with the same playerA & playerB:

df = df[df['Player A_x'] == df['Player A_y']]
gtomer
  • 5,643
  • 1
  • 10
  • 21
0

Just a thought, but typically when I loop through a df I use the iterrows function.

iterrows

instead of:

for i in range(len(games)):

use something like:

for index, row in games.iterrows()

then using the row['Column'] to locate the value you are interested in. I think this will speed up the loop a bit.