How to compare all rows from a Data frame with each other and alter values, in a timely manner?

Question

I have a pandas Dataframe of tennis games with 70,000 games (rows) with two issues:

Every game is duplicated, because for every game between player A and B, there's a row when A plays with B and a row when B plays with A. This happens because I extracted all games played for each player, so I have all games that Nadal played, and then all games that Federer played. For the games I extracted from Nadal's page, Nadal is player A and Federer is player B, and for the games I extracted from Federer's page, Federer is player A and Nadal is player B.
The second issue is that for every game, I only have info about player A, so using the example mentioned before, for the games I extracted where Nadal is player A, facing Federer, I have Nadal's height, age and ranking, but I don't have that info for Federer. And for the games I extracted where Federer is player A, facing Nadal, I have Federer's height, age and ranking, but I don't have that info for Nadal

Bellow is the example of the data for a better understanding:

Player A	Rank	Height	Age	Tourn.	Year	Round	Player B	Result
Nadal	3	185	37	US Open	2019	Finals	Federer	W
Federer	7	183	40	US Open	2019	Finals	Nadal	L

My objective is to add in the same row the information of both players like this:

Player A	Rank	Height	Age	Tourn.	Year	Round	Player B	Rank_B	Height_B	Age_B	Result
Nadal	3	185	37	US Open	2019	Finals	Federer	7	183	40	W

And then remove all duplicate lines.

I have already solved the issue by doing a for loop inside a for loop and comparing every line. Once the criteria I set is met I proceed to change the lines. I consider that a game is duplicate if in the same year, tournament and round, the same players face each other.


import pandas as pd
import numpy as np

games = pd.read_csv("games.csv")

# create the new columns to add info of opponent:

games["Rank_B"] = np.nan
games["Height_B"] = np.nan
games["Age_B"] = np.nan


# loop through every line:

for i in range(0,len(games)):

    # if the row was already mark to delete skip it

    if games.loc[i, "p_name"] == "Delete":
        next

    # for each line compare it to every line:

    for j in range(0,len(games)):

        if games.loc[i, "Tourn."] == games.loc[j, "Tourn."] and games.loc[i, "Year"] == games.loc[j, "Year"] and games.loc[i, "Round"] == games.loc[j, "Round"] and games.loc[i, "Player A"] == games.loc[j, "Player B"]:

            games.loc[i, "Height_B"] = games.loc[j, "Height"]
            games.loc[i, "Rank_B"] = games.loc[j, "Rank"]
            games.loc[i, "Age_B"] = games.loc[j, "Age"]

            # marks row to delete because it is duplicate:

            games.loc[j, "p_name"] = "Delete"
            
            break

games = games[games["p_name"].str.contains("Delete") == False]

The problem is that my solution is very slow, taking a whopping 12 hours to run for 70,000 rows. If I want to run this code with a dataframe of 1,000,000 rows this solution is impractical.

Can you think of a better way to accomplish my objective?

if for both rows the values of the columns tournament, year and round are equal, and if player a is equal to player b — Gil Ramos, Nov 18 '22 at 17:19
You need to set up a dummy column with only 0 and use it as the index for the `pivot`, then flatten using #11 in the duplicate — mozway, Nov 18 '22 at 17:50

gtomer · Answer 1 · 2022-11-18T17:33:54.450

0

Try with merge:

df = pd.merge(left=df, right=df, on=['Tourn.','Round','Year'])

Then remove duplicates:

df.drop_duplicates(subset=['Tourn.','Round','Year'], inplace=True)

After you just need to rename the column names

You can then leave only rows with the same playerA & playerB:

df = df[df['Player A_x'] == df['Player A_y']]

edited Nov 18 '22 at 17:33

answered Nov 18 '22 at 17:21

gtomer

5,643
1
10
21

Jacoby Woods · Answer 2 · 2022-11-18T17:36:10.603

0

Just a thought, but typically when I loop through a df I use the iterrows function.

iterrows

instead of:

for i in range(len(games)):

use something like:

for index, row in games.iterrows()

then using the row['Column'] to locate the value you are interested in. I think this will speed up the loop a bit.

edited Nov 18 '22 at 17:36

answered Nov 18 '22 at 17:34

Jacoby Woods

1
3

How to compare all rows from a Data frame with each other and alter values, in a timely manner?

2 Answers2