I have 2 dataframes for soccer info:
matches
: info on individual matches. Contains columns for the ID of the league in which the match was played, the date the match was played, the season the match was played, and has 22 columns of player IDs (home_player_1
,home_player_2
, ...home_player_11
,away_player_1
,away_player_2
, ...away_player11
) for each of the players that started the match.player_attributes
: info on the attributes of individual players recorded over time. Each player has multiple recordings each season for many seasons, so each row has adate
column that indicates when the data in that row was recorded. This is important because it means the player ID column is not a unique index.
I want to add a league_id
column to the player attributes dataframe so that I know which league the player was in at the time the specific attributes in that row were recorded.
What I've done so far:
- added a
season
column to the player attributes df based on the value in thedate
column for each row - created a new
player_matches
df with columnsmatch_id
,player_id
,season
,league_id
. Each match id in this df shows up in 22 rows (one for the id of each player that started the match).
My goal with this player_matches
df is to use it to find the league id for each row in the player attributes table by matching the player id and season to a player_matches
row and then using the value of league id for that row. The problem is that I haven't found an efficient way to do this. The player attributes df contains 121k rows and the player matches table contains 321k rows so normal easy indexing/masking/merging doesn't work. I tried using apply
like this:
def get_league_id(row):
player_api_id = row['player_api_id']
season = row['season']
match_row = player_matches[(player_matches['player_api_id'] == player_api_id) & (player_matches['season'] == season)].iloc[0]
league_id = match_row['league_id']
return league_id
player_attributes.apply(get_league_id, axis=1)
However, this gives the error: "IndexError: single positional indexer is out-of-bounds". I noticed that the error doesn't happen until the 13th row, so basically this code player_attributes.head(12).apply(get_league_id, axis=1)
works.
Any idea why this error is occurring? Or do you have an alternate method that will work?