Mapping large DF

Question

I have two dataframes I need to map over and combine into one. The first data frame contains NBA players. The headers are 'Date', 'Player', 'Team', 'Position', 'Salary', 'Pos_ID', 'Minutes', 'FPTS', 'USG'. I then have a second dataframe that's the first dataframe but grouped by date and team. The headers for this df are 'Date', 'Team', 'Minutes', 'FGA', 'FTA', 'To'. I'm trying to calculate USG rate for every player in the first dataframe. To do this I need to know what the total minutes, field goal attempts, free throw attempts, and turnover are for each team in each game for a given date. I then divide the same player's stats by the team's total stats. I have a working solution, but it's really slow and doesn't seem to be the most efficient way of doing this.

Here is the code:

import pandas as pd

player_df = pd.read_csv('Sample Data') # replace with sample data file

no_dups = player_df.drop_duplicates()
no_dups.loc[:, 'USG'] = pd.Series(dtype=float)
no_dups = no_dups[no_dups.Minutes != 0]

grouped_teams = no_dups.groupby(['Date', 'Team']).agg({'Minutes':['sum'], 'FGA': ['sum'], 'FTA': ['sum'], 'TO': ['sum'] })
grouped_teams.columns = ['Minutes', 'FGA', 'FTA', 'TO']
grouped_teams = grouped_teams.reset_index()

for index, row in no_dups.iterrows():
    for i, r in grouped_teams.iterrows():
        if no_dups.at[index, 'Team'] == grouped_teams.at[i, 'Team'] and no_dups.at[index, 'Date'] == grouped_teams.at[i, 'Date']:
            no_dups.at[index, 'USG'] = (100*((no_dups.at[index, 'FGA'] + 0.44 * no_dups.at[index, 'FTA'] + no_dups.at[index, 'TO'])*(grouped_teams.at[i, 'Minutes']/5))) / (no_dups.at[index, 'Minutes']*(grouped_teams.at[i, 'FGA']+0.44*grouped_teams.at[i, 'FTA']+grouped_teams.at[i, 'TO']))
    
final_df = no_dups[['Date', 'Player', 'Team', 'Position', 'Salary',  'Minutes', 'FPTS', 'USG']]

print(final_df)

I have removed all players who didn't play and there are duplicates because the same player can play in multiple contests in a single night so I removed those. I then create a df called grouped_teams which is every single team in the df grouped by date and team name. I then iterate over the first df using iterrows and the second df the same way. I need to find each players team and specific date and divide his stats by the calculated total to get the usage rate. The column for that is no_dups.at[index, 'USG']. There are 73k rows in my df so iterating over each one is taking a very long time.

Sample Data

please read [this](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) on how to post a good, reproducible question. you dont have samples of data for us to re-create. — Ukrainian-serge, Feb 17 '21 at 21:20
Please supply the expected [minimal, reproducible example](https://stackoverflow.com/help/minimal-reproducible-example) (MRE). We should be able to copy and paste a contiguous block of your code, execute that file, and reproduce your problem along with tracing output for the problem points. This lets us test our suggestions against your test data and desired output. Please [include a minimal data frame](https://stackoverflow.com/questions/52413246/how-to-provide-a-reproducible-copy-of-your-dataframe-with-to-clipboard) as part of your MRE. — Prune, Feb 17 '21 at 21:23
I'm not sure what you mean by "there ya go", since you haven't addressed the capabilities in my comment. — Prune, Feb 17 '21 at 22:20
"Please include a minimal data frame as part of your MRE" @Prune — Austin Johnson, Feb 17 '21 at 22:23
That's one aspect: you haven't provided that. Your code dies on the first line, due to two undefined symbols. I'll wait until you've posted the MRE. — Prune, Feb 17 '21 at 22:25
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/228870/discussion-between-austin-johnson-and-prune). — Austin Johnson, Feb 17 '21 at 22:45

chitown88 · Accepted Answer · 2021-02-18T10:37:02.110

It takes forever because you are iterating row by row. I can't seem to find the article, but I remember reading somewhere when comparing ways to iterate through a dataframe, itertuples was about 10x faster than iterrows, and zip was about 100x faster. The problem I have sometimes with wanting to switch from iterrows to itertuples though, is you lose the column names as the index, so you need to be particularly sure the order of the columns with using itertuples (although now that I'm thinking about it, I think there's a way I could keep track of that dynamically).

But the fastest way you can do this, is by doing the calculations on all the rows instead doing each one individually.

What I would do, is in your second dataframe, you calculated the team totals. So do a left join/merge on ['Date','Team'] to match up the totals on the no_dups dataframe. Then you can calculate it out by using the whole rows of columns, instead of doing it by 1 row at a time. I also slightly changed the names of the columns as if you merge and there are columns with the same name, it'll add a suffix _x and _y. And there's ways to deal with that, but figured just change the names outright. I also, slighty changed how you named the columns, by instead of hard coding it (which means the columns HAVE to be in the order), it can handle the names in a more robust way.

There is another issue you have too. The date column has different formats (Ie '1/1/18' and '2018-01-01', so in your groupby, those won't be aggregated together. So we need to deal with that first. It appears to only be with the Brooklyn Nets, but could be more in your full data set.

Code:

import pandas as pd

player_df = pd.read_csv('Sample Data.csv') # replace with sample data file

# Get the date column to be the same
player_df['Date'] = pd.to_datetime(player_df['Date'])

no_dups = player_df.drop_duplicates()
no_dups = no_dups[no_dups.Minutes != 0]

grouped_teams = no_dups.groupby(['Date', 'Team']).agg({'Minutes':['sum'], 'FGA': ['sum'], 'FTA': ['sum'], 'TO': ['sum'] })
grouped_teams.columns = ['tot_' + col[0] for col in grouped_teams.columns]
grouped_teams = grouped_teams.reset_index()


# Merge grouped_teams to no_dups on Team and Date
no_dups = no_dups.merge(grouped_teams, how='left', on=['Team','Date'])

# Do the calculations
no_dups['USG'] = (100*((no_dups['FGA'] + 0.44 * no_dups['FTA'] + no_dups['TO'])*(no_dups['tot_Minutes']/5))) / (no_dups['Minutes']*(no_dups['tot_FGA']+0.44*no_dups['tot_FTA']+no_dups['tot_TO']))
    
final_df = no_dups[['Date', 'Player', 'Team', 'Position', 'Salary',  'Minutes', 'FPTS', 'USG']]

print(final_df)

Time:

I timed each way (without including the time to read in the csv).

On the sample data (4493 rows), the iterrows took approx 3 minutes 46.66 seconds.
My code took approx 0.0568 seconds so almost 4000x faster.

Mapping large DF

1 Answers1