3

I've actually already asked this question for SQL and got a great answer here: SQL - LAG to get previous value if condition using multiple previous columns satisfied

But now I need it for Pandas. Say we have a dataframe:

df = pd.DataFrame({'id':[1,2,3,4,5,6,7,8],
                   'EventName':['Team A vs Team B',
                                'Team A vs Team B',
                                'Team C vs Team D',
                                'Team Z vs Team A',
                                'Team A vs Team B',
                                'Team C vs Team D',
                                'Team C vs Team D',
                                'Team E vs Team F',],
                   'HomeTeam': ['Team A', 'Team A', 'Team C', 'Team Z',
                                'Team A', 'Team C', 'Team C', 'Team E'],
                   'Metric':[5,7,6,8,9,3,1,2]})

Which results in:

id  EventName           HomeTeam    Metric
------------------------------------------
1   Team A vs Team B    Team A      5
2   Team A vs Team B    Team A      7
3   Team C vs Team D    Team C      6
4   Team Z vs Team A    Team Z      8
5   Team A vs Team B    Team A      9
6   Team C vs Team D    Team C      3
7   Team C vs Team D    Team C      1
8   Team E vs Team F    Team E      2

I want to calculate a new column PreviousMetricN where N can be 1, 2, 3, ... which shows the previous value for Metric, but only if the HomeTeam was involved in the previous event. For example:

id  EventName           HomeTeam    Metric  PreviousMetric1 PreviousMetric2
------------------------------------------------------------------------
1   Team A vs Team B    Team A      5       NULL            NULL
2   Team A vs Team B    Team A      7       5               NULL
3   Team C vs Team D    Team C      6       NULL            NULL
4   Team Z vs Team A    Team Z      8       NULL            NULL
5   Team A vs Team B    Team A      9       8               7
6   Team C vs Team D    Team C      3       6               NULL
7   Team C vs Team D    Team C      1       3               6
8   Team E vs Team F    Team E      2       NULL            NULL

I guess it would be pretty easy with a for loop. But I need a solution that is vectorized or uses some combination of shift/groupby/np.where. Not even sure where to start with this?

John F
  • 994
  • 10
  • 26
  • I did something similar to this recently. After using the Pandas shift() function, I wrote a for loop with nested if statement. – windyvation Mar 29 '21 at 17:22

2 Answers2

3

First extract the AwayTeam and reshape the data so that there is a row with the Metric for each Home/Away Team. You can melt like I do here which gives much nicer column names but you need to sort so that rows are grouped together by the original index.

df['AwayTeam'] = df['EventName'].str.split(' vs ').str[1]

temp = (df.rename_axis(index='index').reset_index()
          .melt(id_vars=['index', 'id', 'EventName', 'Metric'], 
                value_vars=['HomeTeam', 'AwayTeam'],
                var_name='type', value_name='Team')
          .sort_values('index'))

#    index  id         EventName  Metric      type    Team
#0       0   1  Team A vs Team B       5  HomeTeam  Team A
#8       0   1  Team A vs Team B       5  AwayTeam  Team B
#1       1   2  Team A vs Team B       7  HomeTeam  Team A
#...
#14      6   7  Team C vs Team D       1  AwayTeam  Team D
#7       7   8  Team E vs Team F       2  HomeTeam  Team E
#15      7   8  Team E vs Team F       2  AwayTeam  Team F

Since a team cannot play itself we group to get the last value regardless of it was the Away or Home Team. Loop over the shifts, determine the value and then join back to the original DataFrame. We ensure alignment is on the original index.

shifts = [1,2]

for shift in shifts:
    temp[f'Prev_{shift}'] = temp.groupby('Team')['Metric'].shift(shift)

temp = temp[temp['type'].eq('HomeTeam')]
temp = temp.set_index('index')[[f'Prev_{shift}' for shift in shifts]]

df = pd.concat([df, temp], axis=1)

   id         EventName HomeTeam  Metric AwayTeam  Prev_1  Prev_2
0   1  Team A vs Team B   Team A       5   Team B     NaN     NaN
1   2  Team A vs Team B   Team A       7   Team B     5.0     NaN
2   3  Team C vs Team D   Team C       6   Team D     NaN     NaN
3   4  Team Z vs Team A   Team Z       8   Team A     NaN     NaN
4   5  Team A vs Team B   Team A       9   Team B     8.0     7.0
5   6  Team C vs Team D   Team C       3   Team D     6.0     NaN
6   7  Team C vs Team D   Team C       1   Team D     3.0     6.0
7   8  Team E vs Team F   Team E       2   Team F     NaN     NaN
ALollz
  • 57,915
  • 7
  • 66
  • 89
3

Use @Alollz structure:

df = pd.DataFrame({'id':[1,2,3,4,5,6,7,8],
                   'EventName':['Team A vs Team B',
                                'Team A vs Team B',
                                'Team C vs Team D',
                                'Team Z vs Team A',
                                'Team A vs Team B',
                                'Team C vs Team D',
                                'Team C vs Team D',
                                'Team E vs Team F',],
                   'HomeTeam': ['Team A', 'Team A', 'Team C', 'Team Z',
                                'Team A', 'Team C', 'Team C', 'Team E'],
                   'Metric':[5,7,6,8,9,3,1,2]})

dfe = df.assign(teams = df['EventName'].str.split(' vs ')).explode('teams')

shifts = [1, 2, 3]
for i in shifts:
    mapper = dfe.groupby('teams')['Metric'].shift(i).mask(dfe['teams'] != dfe['HomeTeam']).drop_duplicates()
    df[f'PreviousMetrics{i}'] = df.index.map(mapper)
df

Output:

   id         EventName HomeTeam  Metric  PreviousMetrics1  PreviousMetrics2  PreviousMetrics3
0   1  Team A vs Team B   Team A       5               NaN               NaN               NaN
1   2  Team A vs Team B   Team A       7               5.0               NaN               NaN
2   3  Team C vs Team D   Team C       6               NaN               NaN               NaN
3   4  Team Z vs Team A   Team Z       8               NaN               NaN               NaN
4   5  Team A vs Team B   Team A       9               8.0               7.0               5.0
5   6  Team C vs Team D   Team C       3               6.0               NaN               NaN
6   7  Team C vs Team D   Team C       1               3.0               6.0               NaN
7   8  Team E vs Team F   Team E       2               NaN               NaN               NaN
Scott Boston
  • 147,308
  • 15
  • 139
  • 187