0

I have 2 dataframes for soccer info:

  • matches: info on individual matches. Contains columns for the ID of the league in which the match was played, the date the match was played, the season the match was played, and has 22 columns of player IDs (home_player_1, home_player_2, ... home_player_11, away_player_1, away_player_2, ... away_player11) for each of the players that started the match.
  • player_attributes: info on the attributes of individual players recorded over time. Each player has multiple recordings each season for many seasons, so each row has a date column that indicates when the data in that row was recorded. This is important because it means the player ID column is not a unique index.

I want to add a league_id column to the player attributes dataframe so that I know which league the player was in at the time the specific attributes in that row were recorded.

What I've done so far:

  • added a season column to the player attributes df based on the value in the date column for each row
  • created a new player_matches df with columns match_id, player_id, season, league_id. Each match id in this df shows up in 22 rows (one for the id of each player that started the match).

My goal with this player_matches df is to use it to find the league id for each row in the player attributes table by matching the player id and season to a player_matches row and then using the value of league id for that row. The problem is that I haven't found an efficient way to do this. The player attributes df contains 121k rows and the player matches table contains 321k rows so normal easy indexing/masking/merging doesn't work. I tried using apply like this:

def get_league_id(row):
    player_api_id = row['player_api_id']
    season = row['season']

    match_row = player_matches[(player_matches['player_api_id'] == player_api_id) & (player_matches['season'] == season)].iloc[0]
    league_id = match_row['league_id']

    return league_id

player_attributes.apply(get_league_id, axis=1)

However, this gives the error: "IndexError: single positional indexer is out-of-bounds". I noticed that the error doesn't happen until the 13th row, so basically this code player_attributes.head(12).apply(get_league_id, axis=1) works.

Any idea why this error is occurring? Or do you have an alternate method that will work?

Uche Ozoemena
  • 816
  • 2
  • 10
  • 25
  • 2
    Please provide a meaningful [minimal reproducible example](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) of input/output. – mozway May 21 '23 at 05:48

1 Answers1

0

Turns out there is in fact a merge operation that works! :)

pd.merge(player_attributes, player_matches, left_on=['player_id', 'season'], right_on=['player_id', 'season'], how='left')

Basically this is preserving all the rows in player_attributes and joining the row in player_matches that shares the same player_id and season. That adds the league_id from the row in player_matches.

Uche Ozoemena
  • 816
  • 2
  • 10
  • 25