How to label rows of DataFrame where the set of column values match?

Question

I have a DataFrame containing columns that overlap in a sense:

import pandas as pd

df = pd.DataFrame({
    'Date': ['2017-04-02', '2017-04-02', '2017-04-02', '2017-04-02', '2017-04-02', '2017-04-02'],
    'Team': ['CHC', 'ARI', 'NYY', 'TBR', 'STL', 'SFG'],
    'Home': [True, False, True, False, False, True],
    'Opp': ['STL', 'SFG', 'TBR', 'NYY', 'CHC', 'ARI'],
    'Rslt': ['L', 'W', 'L', 'W', 'W', 'L']
})

df['Date'] = pd.to_datetime(df['Date'])

print(df)

OUTPUT:

        Date Team   Home  Opp Rslt
0 2017-04-02  CHC   True  STL    L
1 2017-04-02  ARI  False  SFG    W
2 2017-04-02  NYY   True  TBR    L
3 2017-04-02  TBR  False  NYY    W
4 2017-04-02  STL  False  CHC    W
5 2017-04-02  SFG   True  ARI    L

For the date of 2017-04-01, there were 3 games played. The DataFrame contains the game results for each day for each team. This results in 6 results. Take row 2 and 3, this is a game between NYY and TBR:

Row 2 gives the NYY result of L, meaning they lost
Row 3 gives the TBR result of W, meaning they won

What I'm trying to do is group all row pairs that relate to the same game. My initial idea was to create a new column that would act as a label for the pair and then use that to group on or set MultiIndex. I thought about it and considered concatenating the three columns into a single string for each row and then, using sets, look through all rows for each date in Date and find the other row that contains the same characters:

df['Match'] = df['Date'].dt.strftime('%Y-%m-%d') + ',' + df['Team'] + ',' + df['Opp']

print(df)

OUTPUT:

        Date Team   Home  Opp Rslt               Match
0 2017-04-02  CHC   True  STL    L  2017-04-02,CHC,STL
1 2017-04-02  ARI  False  SFG    W  2017-04-02,ARI,SFG
2 2017-04-02  NYY   True  TBR    L  2017-04-02,NYY,TBR
3 2017-04-02  TBR  False  NYY    W  2017-04-02,TBR,NYY
4 2017-04-02  STL  False  CHC    W  2017-04-02,STL,CHC
5 2017-04-02  SFG   True  ARI    L  2017-04-02,SFG,ARI

From here, I'm not sure how to proceed. I have a method in mind using sets that I've used in the past. If we focus on row 2 and 3 again, subtracting the sets of the string, split using the ,, and taking the bool() will return False for two sets containing the same string elements and True for anything else (different sets):

print(
    bool(set('2017-04-02,NYY,TBR'.split(',')) - set('2017-04-02,TBR,NYY'.split(',')))
)
print(
    bool(set('2017-04-02,NYY,TBR'.split(',')) - set('2017-04-02,CHC,STL'.split(',')))
)

OUTPUT:

False
True

Is there a better way to take a row value in a column and lookup all other row values in that same column and label the rows where they are related? The kind of label I would like to have is creating a unique numbering of games. Since these three games happen on the same day, labelling the pairs as 1, 2, 3 would be great so that each game pair for each day has a unique ID.

P.S. I've also seen this post that kinda looks like what I'm trying to do... I've tried using .isin() but kept running into errors so scrapped that approach. I thought about pd.DataFrame.lookup but I'm not quite sure if that's the right approach either. Just need a way to group up each pair of rows.

Go easy on me if there is some formatting mistakes. It's been a while since I've needed to post a question. — MRT, Jun 29 '22 at 17:26
What is your ultimate expected output from these six records? — Scott Boston, Jun 29 '22 at 17:45

BeRT2me · Answer 1 · 2022-06-29T18:08:51.663

Merge the DataFrame on itself, swap values where it's not a home game, take the information you want, and then drop the duplicates:

df2 = df.merge(df, left_on=['Date', 'Team'], right_on=['Date', 'Opp'], suffixes=['_Home', '_Away'])
swap_cols = ['Team', 'Opp', 'Rslt', 'Home']
for col in swap_cols:
    df2[f'{col}_Home'], df2[f'{col}_Away'] = np.where(df2.Home_Home, [df2[f'{col}_Home'],df2[f'{col}_Away']], [df2[f'{col}_Away'],df2[f'{col}_Home']])

df2 = df2[['Date','Team_Home', 'Rslt_Home', 'Team_Away', 'Rslt_Away']].drop_duplicates()
print(df2)

Output:

        Date Team_Home Rslt_Home Team_Away Rslt_Away
0 2017-04-02       CHC         L       STL         W
1 2017-04-02       SFG         L       ARI         W
2 2017-04-02       NYY         L       TBR         W

score 1 · Answer 2 · answered Jun 29 '22 at 18:05

IIUC,

You can do it this way using merge and query.

df_match = df.merge(df, 
                    left_on=['Date', 'Team'], 
                    right_on=['Date', 'Opp'], 
                    suffixes=('','_opp'))

df_match.query('Home')

Output:

        Date Team  Home  Opp Rslt Team_opp  Home_opp Opp_opp Rslt_opp
0 2017-04-02  CHC  True  STL    L      STL     False     CHC        W
2 2017-04-02  NYY  True  TBR    L      TBR     False     NYY        W
5 2017-04-02  SFG  True  ARI    L      ARI     False     SFG        W

Extracting what one wants to keep, rather than working to remove what one doesn't want... A perfect simplification of what I did :') — BeRT2me, Jun 29 '22 at 18:08

mitoRibo · Answer 3 · 2022-06-29T18:19:00.790

To match the other answers better, you can deduplicate by selecting only the rows where the team was at home and then creating new columns for Result_home and Result_away.

I haven't tested, but I think this should be faster than approaches that merge the table onto itself

I'm not sure what you want your final output to look like but here's one option

rename_cols = {
    'Team':'Home_Team',
    'Opp':'Away_Team',
    'Rslt':'Result_home',
}

df = df[df['Home']]
df['Result_away'] = df['Rslt'].replace({'L':'W','W':'L'})
df = df.rename(columns=rename_cols).drop(columns=['Home'])
df

How to label rows of DataFrame where the set of column values match?

3 Answers3