How can I add new columns efficiently using groupby in Pandas?

Question

I am using nba_py to get the scoreboard data for some NBA matches.

Below is an example of how the data are structured:

    SEASON |     GAME_DATE_EST | GAME_SEQUENCE | GAME_ID | HOME_TEAM_ID | VISITOR_TEAM_ID | WINNER

0   2013    2013-10-05T00:00:00     1            11300001   12321         1610612760        V
1   2013    2013-10-05T00:00:00     2            11300002   1610612754    1610612741        V
2   2013    2013-10-05T00:00:00     3            11300003   1610612745    1610612740        V
3   2013    2013-10-05T00:00:00     4            11300004   1610612747    1610612744        H
4   2013    2013-10-06T00:00:00     1            11300005   12324         1610612755        V

You can find a part of the data here: NBA Games Data.

My aim is to create and add to the original data the following columns:

For the hometeam:

   1. Total wins/losses for hometeam if hometeam plays at home ("HOMETEAM_HOME_WINS"/"HOMETEAM_HOME_LOSSES")
   2. Total wins/losses for hometeam if hometeam is visiting ("HOMETEAM_VISITOR_WINS"/"HOMETEAM_VISITOR_LOSSES")

For the visitor_team:

   3. Total wins/losses for visitor_team if visitor_team plays at home ("VISITOR_TEAM_HOME_WINS"/"VISITOR_TEAM_HOME_LOSSES")
   4. Total wins/losses for visitor_team if visitor_team is visiting ("VISITOR_TEAM_VISITOR_WINS"/"VISITOR_TEAM_VISITOR_LOSSES")

My first simplistic approach is below:

def get_home_team_home_wins(x):
    hometeam = x.HOME_TEAM_ID
    season = x.SEASON
    gid = x.name
    season_hometeam_games = grouped_seasons_hometeams.get_group((season, hometeam))
    home_games = season_hometeam_games[(season_hometeam_games.index < gid)]

    if not home_games.empty:
        try:
            home_wins = home_games.FTR.value_counts()["H"]
        except Exception as e:
            home_wins = 0
    else:
        home_wins = 0

grouped_seasons_hometeams = df.groupby(["SEASON", "HOME_TEAM_ID"])

df["HOMETEAM_HOME_WINS"] = df.apply(lambda x: get_home_team_home_wins(x), axis=1)

Another approach is iterating over the rows:

grouped_seasons = df.groupby("SEASON")
df["HOMETEAM_HOME_WINS"] = 0

current_season = 0
for index,row in df.iterrows():
    season = row.SEASON
    if season != current_season:
        current_season = season
        season_games = grouped_seasons.get_group(current_season)

    hometeam = row.HOME_TEAM_ID
    gid = row.name
    games = season_games[(season_games.index < gid)]
    home_games = games[(games.HOME_TEAM_ID == hometeam)]

    if not home_games.empty:
        try:
            home_wins = home_games.FTR.value_counts()["H"]
        except Exception as e:
            home_wins = 0
    else:
        home_wins = 0

    row["HOME_TEAM_HOME_WINS_4"] = home_wins
    df.ix[index] = row

Update 1:

Below there are functions for getting wins/losses for hometeam if it plays at home and if it visits. A similar one would be for the visitor_team.

def get_home_team_home_wins_losses(x):
    hometeam = x.HOME_TEAM_ID
    season = x.SEASON
    gid = x.name

    games = df[(df.SEASON == season) & (df.index < gid)]
    home_team_home_games = games[(games.HOME_TEAM_ID == hometeam)]  


    # HOMETEAM plays at home
    if not home_team_home_games.empty:
        home_team_home_games_value_counts = home_team_home_games.FTR.value_counts()

        try:
            home_team_home_wins = home_team_home_games_value_counts["H"]
        except Exception as e:
            home_team_home_wins = 0

        try:
            home_team_home_losses = home_team_home_games_value_counts["V"]
        except Exception as e:
            home_team_home_losses = 0
    else:
        home_team_home_wins = 0
        home_team_home_losses = 0

    return [home_team_home_wins, home_team_home_losses]

def get_home_team_visitor_wins_losses(x):
    hometeam = x.HOME_TEAM_ID
    season = x.SEASON
    gid = x.name

    games = df[(df.SEASON == season) & (df.index < gid)]
    home_team_visitor_games = games[(games.VISITOR_TEAM_ID == hometeam)]

    # HOMETEAM visits
    if not home_team_visitor_games.empty:
        home_team_visitor_games_value_counts = home_team_visitor_games.FTR.value_counts()

        try:
            home_team_visitor_wins = home_team_visitor_games_value_counts["V"]
        except Exception as e:
            home_team_visitor_wins = 0

        try:
            home_team_visitor_losses = home_team_visitor_games_value_counts["H"]
        except Exception as e:
            home_team_visitor_losses = 0
    else:
        home_team_visitor_wins = 0
        home_team_visitor_losses = 0    

    return [home_team_visitor_wins, home_team_visitor_losses]

df["HOME_TEAM_HOME_WINS"], df["HOME_TEAM_HOME_LOSSES"] = zip(*df.apply(lambda x: get_home_team_home_wins_losses(x), axis=1))
df["HOME_TEAM_VISITOR_WINS"], df["HOME_TEAM_VISITOR_LOSSES"] = zip(*df.apply(lambda x: get_home_team_visitor_wins_losses(x), axis=1))
df["HOME_TEAM_WINS"] = df["HOME_TEAM_HOME_WINS"] + df["HOME_TEAM_VISITOR_WINS"]
df["HOME_TEAM_LOSSES"] = df["HOME_TEAM_HOME_LOSSES"] + df["HOME_TEAM_VISITOR_LOSSES"]

The above approaches are not efficient. So, I am thinking of using groupby but it's not really clear how.

I will add updates whenever I find something more efficient.

Any ideas ? Thanks.

Can you add sample data with structure? There are some fields (name, season) referenced that you do not explicitly show in data structure. — Parfait, Feb 17 '16 at 04:59

Parfait · Answer 1 · 2016-02-17T18:11:20.930

Consider using transform() but first conditionally create HOMEWINNER and VISITWINNER integer columns. Commented out are easier to read equivalent if/else calculations using numpy.where() which you may/may not have available as a package.

Do note transform() retains all rows but will aggregate by the IDs, so every record of a particular HOME_TEAM_ID should repeat values in these aggregate columns.:

nbadf['VISITWINNER'] =  [1 if x == 'V' else 0 for x in nbadf['WINNER']]
#nbadf['VISITWINNER'] = np.where(nbadf['WINNER']=='V', 1, 0)

nbadf['HOMEWINNER'] = [1 if x == 'H' else 0 for x in nbadf['WINNER']]    
#nbadf['HOMEWINNER'] = np.where(nbadf['WINNER']=='H', 1, 0)

nbadf['HOME_TEAM_WINS'] = nbadf.groupby(['HOME_TEAM_ID','SEASON'])\ 
                                        ['HOMEWINNER'].transform(sum)
nbadf['HOME_TEAM_LOSSES'] = nbadf.groupby(['HOME_TEAM_ID','SEASON'])\
                                          ['VISITWINNER'].transform(sum)

nbadf['VISIT_TEAM_WINS'] = nbadf.groupby(['VISITOR_TEAM_ID','SEASON'])\
                                         ['VISITWINNER'].transform(sum)
nbadf['VISIT_TEAM_LOSSES'] = nbadf.groupby(['VISITOR_TEAM_ID','SEASON'])\
                                           ['HOMEWINNER'].transform(sum)

nbadf.drop(['HOMEWINNER', 'VISITWINNER'],inplace=True,axis=1)

#   SEASON  ...  WINNER  HOME_TEAM_WINS  HOME_TEAM_LOSSES  VISIT_TEAM_WINS  VISIT_TEAM_LOSSES
#0    2013  ...      V               0                 1                1                  0
#1    2013  ...      V               0                 1                1                  0
#2    2013  ...      V               0                 1                1                  0
#3    2013  ...      H               1                 0                0                  1
#4    2013  ...      V               0                 1                1                  0

Now for instances of home teams later visiting and vice versa, consider a merge on the IDs with subsetted data frames (change column numbers if needed). This captures home teams who are also visitor teams. So run above aggregates on mergedf (and calculate same conditional HOMEWINNER using this time WINNER_x and VISITWINNER using WINNER_y):

# MERGES HOME SUBSET DF AND VISITOR SUBSET DF
mergedf = pd.merge(nbadf[[0,1,2,3,4,6]], nbadf[[0,1,2,3,5,6]],
                   left_on=['HOME_TEAM_ID'], right_on=['VISITOR_TEAM_ID'], how='inner')

mergedf['HOMETEAM_AS_VISITOR_WINS'] = mergedf.groupby(['VISITOR_TEAM_ID','SEASON_y'])\ 
                                                      ['VISITWINNER'].transform(sum)

mergedf['VISITORTEAM_AS_HOME_WINS'] = mergedf.groupby(['HOME_TEAM_ID','SEASON_x'])\ 
                                                      ['HOMEWINNER'].transform(sum)

In my case I will use cumsum. The HOME_TEAM_WINS/LOSSES includes the games that the HOME_TEAM plays at home and respectively for VISIT_TEAM_WINS/LOSSES the VISITOR_TEAM visits, right — IordanouGiannis, Feb 17 '16 at 10:48
It should as the pair uses different IDs in groupby. Check results on larger data set. — Parfait, Feb 17 '16 at 14:34
What about when the HOME_TEAM visits and VISITOR_TEAM plays at home ? — IordanouGiannis, Feb 17 '16 at 17:11
Ah-hah! I almost added that part to original answer but did not want to confuse you. Consider a merge of home team and visitor team. See update. — Parfait, Feb 17 '16 at 18:05
The merge part is a bit confusing. How does it fit with the first part of your answer ? Also, I added a simple function to do what I need. — IordanouGiannis, Feb 17 '16 at 23:19
I think you should leave your answer, it was helpful. The code I have added with update 1 is just a step to finding something more efficient. — IordanouGiannis, Feb 18 '16 at 01:03

How can I add new columns efficiently using groupby in Pandas?

1 Answers1