I have two dataframes I need to map over and combine into one. The first data frame contains NBA players. The headers are 'Date', 'Player', 'Team', 'Position', 'Salary', 'Pos_ID', 'Minutes', 'FPTS', 'USG'. I then have a second dataframe that's the first dataframe but grouped by date and team. The headers for this df are 'Date', 'Team', 'Minutes', 'FGA', 'FTA', 'To'. I'm trying to calculate USG rate for every player in the first dataframe. To do this I need to know what the total minutes, field goal attempts, free throw attempts, and turnover are for each team in each game for a given date. I then divide the same player's stats by the team's total stats. I have a working solution, but it's really slow and doesn't seem to be the most efficient way of doing this.
Here is the code:
import pandas as pd
player_df = pd.read_csv('Sample Data') # replace with sample data file
no_dups = player_df.drop_duplicates()
no_dups.loc[:, 'USG'] = pd.Series(dtype=float)
no_dups = no_dups[no_dups.Minutes != 0]
grouped_teams = no_dups.groupby(['Date', 'Team']).agg({'Minutes':['sum'], 'FGA': ['sum'], 'FTA': ['sum'], 'TO': ['sum'] })
grouped_teams.columns = ['Minutes', 'FGA', 'FTA', 'TO']
grouped_teams = grouped_teams.reset_index()
for index, row in no_dups.iterrows():
for i, r in grouped_teams.iterrows():
if no_dups.at[index, 'Team'] == grouped_teams.at[i, 'Team'] and no_dups.at[index, 'Date'] == grouped_teams.at[i, 'Date']:
no_dups.at[index, 'USG'] = (100*((no_dups.at[index, 'FGA'] + 0.44 * no_dups.at[index, 'FTA'] + no_dups.at[index, 'TO'])*(grouped_teams.at[i, 'Minutes']/5))) / (no_dups.at[index, 'Minutes']*(grouped_teams.at[i, 'FGA']+0.44*grouped_teams.at[i, 'FTA']+grouped_teams.at[i, 'TO']))
final_df = no_dups[['Date', 'Player', 'Team', 'Position', 'Salary', 'Minutes', 'FPTS', 'USG']]
print(final_df)
I have removed all players who didn't play and there are duplicates because the same player can play in multiple contests in a single night so I removed those. I then create a df called grouped_teams
which is every single team in the df grouped by date and team name. I then iterate over the first df using iterrows
and the second df the same way. I need to find each players team and specific date and divide his stats by the calculated total to get the usage rate. The column for that is no_dups.at[index, 'USG']
. There are 73k rows in my df so iterating over each one is taking a very long time.