Faster way to update dataframe rows after group by

Question

Sample Input:


    GAME_ID     TIME    PER EVENT
0   0022000394  12:00   1   12
1   0022000394  12:00   1   10
2   0022000394  11:36   1   1
3   0022000394  11:24   1   1
4   0022000394  11:04   1   1
5   0022000394  10:41   1   1
6   0022000394  10:30   1   2
7   0022000394  10:29   1   4
8   0022000394  10:17   1   1
9   0022000394  10:01   1   1
10  0022000394  9:48    1   2
11  0022000394  9:46    1   4
12  0022000394  9:42    1   6
13  0022000394  9:42    1   3
14  0022000394  9:42    1   3
15  0022000394  9:25    1   1
16  0022000394  9:15    1   1
17  0022000394  9:15    1   6
18  0022000394  9:15    1   3
19  0022000394  8:53    1   1
20  0022000394  8:33    1   1
21  0022000394  8:22    1   1
22  0022000394  8:16    1   2
23  0022000394  8:16    1   4
24  0022000394  8:12    1   2

I have a dataframe where I get a group of rows using groupby.

If that group contains 3 rows where the column EVENTMSGTYPE contains all of [1, 6, & 3] I want to update the row in the original dataframe where EVENTMSGTYPE == 1

Current Working Solution (Slow)

# Group by
for _, data in df.groupby(['GAME_ID', 'TIME', 'PER']):

    # If EVENT in group contains 1, 6, and 3 then update original df
    if all(x in list(data.EVENT) for x in [1, 6, 3]):

        # Update original df row where EVENT equals 1, should only have one value
        index = data[data.EVENT == 1].index.values[0]

        # Set UPDATED to True
        df.at[index, 'UPDATED'] = True

Expected Output:

    GAME_ID     TIME    PER EVENT UPDATED
...
16  0022000394  9:15    1   1     True
...

My dataframe has 1,694,389 rows and this takes ~53 seconds to run on my machine, can the performance of this be improved?

this guide has some good tips for creating dummy data that behaves like your problem: [how to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) — Michael Delgado, May 29 '22 at 23:03

BeRT2me · Accepted Answer · 2022-05-30T00:27:55.063

3

df['UPDATED'] = df.groupby(['GAME_ID', 'TIME', 'PER'])['EVENT'].filter(lambda x: set(x) >= {1,3,6}, dropna=False).eq(1)

Output:

     GAME_ID   TIME  PER  EVENT  UPDATED
0   22000394  12:00    1     12    False
1   22000394  12:00    1     10    False
2   22000394  11:36    1      1    False
3   22000394  11:24    1      1    False
4   22000394  11:04    1      1    False
5   22000394  10:41    1      1    False
6   22000394  10:30    1      2    False
7   22000394  10:29    1      4    False
8   22000394  10:17    1      1    False
9   22000394  10:01    1      1    False
10  22000394   9:48    1      2    False
11  22000394   9:46    1      4    False
12  22000394   9:42    1      6    False
13  22000394   9:42    1      3    False
14  22000394   9:42    1      3    False
15  22000394   9:25    1      1    False
16  22000394   9:15    1      1     True
17  22000394   9:15    1      6    False
18  22000394   9:15    1      3    False
19  22000394   8:53    1      1    False
20  22000394   8:33    1      1    False
21  22000394   8:22    1      1    False
22  22000394   8:16    1      2    False
23  22000394   8:16    1      4    False
24  22000394   8:12    1      2    False

Stole the set logic from sammywemmy~

edited May 30 '22 at 00:27

answered May 29 '22 at 23:36

BeRT2me

12,699
2
13
31

1

doesn't match expected output – sammywemmy May 29 '22 at 23:39
Only row 16 should be `True` – radio23 May 29 '22 at 23:41
The original without the set logic takes a lot longer than the original solution – radio23 May 29 '22 at 23:57
1

Using the set logic this is much faster, 16 seconds on my machine – radio23 May 29 '22 at 23:59
Why do you use ">=" instead of "=="? – Drakax May 30 '22 at 00:14
@Drakax Because, for example: `{1,3,6,8} >= {1,3,6}` is `True`, but `{1,3,6,8} == {1,3,6}` is `False` – BeRT2me May 30 '22 at 00:29
Ok thank you :) Could you please try if this one is faster "df['UPDATED'] = df.groupby(['GAME_ID', 'TIME', 'PER'])['EVENT'].transform(lambda x: set(x) >= {1,3,6} and x == 1)"? And I was wondering if we could have used "df.groupby(['GAME_ID', 'TIME', 'PER'])['EVENT'].unique().???" somehow? – Drakax May 30 '22 at 01:02
I couldn't tell you why, but the `transform` method is far slower than the `filter` method on the test dataset. Also, `unique()` and `set()` are very similar in the type of result they produce, but `set()` will have a lot less bloat. – BeRT2me May 30 '22 at 01:36

score 2 · Answer 2 · answered May 29 '22 at 23:52

One option is using set with transform; speed wise I expect Bert2ME's solution to be faster:

df.assign(UPDATED = df.groupby(grouper)
                      .EVENT
                      .transform(lambda x: set(x) >= {1,3,6}) 
                                           & df.EVENT.eq(1))

     GAME_ID   TIME  PER  EVENT  UPDATED
0   22000394  12:00    1     12    False
1   22000394  12:00    1     10    False
2   22000394  11:36    1      1    False
3   22000394  11:24    1      1    False
4   22000394  11:04    1      1    False
5   22000394  10:41    1      1    False
6   22000394  10:30    1      2    False
7   22000394  10:29    1      4    False
8   22000394  10:17    1      1    False
9   22000394  10:01    1      1    False
10  22000394   9:48    1      2    False
11  22000394   9:46    1      4    False
12  22000394   9:42    1      6    False
13  22000394   9:42    1      3    False
14  22000394   9:42    1      3    False
15  22000394   9:25    1      1    False
16  22000394   9:15    1      1     True
17  22000394   9:15    1      6    False
18  22000394   9:15    1      3    False
19  22000394   8:53    1      1    False
20  22000394   8:33    1      1    False
21  22000394   8:22    1      1    False
22  22000394   8:16    1      2    False
23  22000394   8:16    1      4    False
24  22000394   8:12    1      2    False

Rodalm · Answer 3 · 2022-05-29T23:42:48.530

1

idx_cols = ['GAME_ID', 'TIME', 'PER']

df = df.set_index(idx_cols)

cond1 = (
    df.groupby(level=idx_cols)['EVENT']
      .agg(lambda event_group: all(x in event_group for x in [1, 6, 3])) 
      .reindex_like(df)
)

cond2 = df['EVENT'].eq(1)

df['UPDATED'] = cond1 & cond2

df = df.reset_index()
print(df)

Output:

     GAME_ID   TIME  PER  EVENT  UPDATED
0   22000394  12:00    1     12    False
1   22000394  12:00    1     10    False
2   22000394  11:36    1      1    False
3   22000394  11:24    1      1    False
4   22000394  11:04    1      1    False
5   22000394  10:41    1      1    False
6   22000394  10:30    1      2    False
7   22000394  10:29    1      4    False
8   22000394  10:17    1      1    False
9   22000394  10:01    1      1    False
10  22000394   9:48    1      2    False
11  22000394   9:46    1      4    False
12  22000394   9:42    1      6    False
13  22000394   9:42    1      3    False
14  22000394   9:42    1      3    False
15  22000394   9:25    1      1    False
16  22000394   9:15    1      1     True
17  22000394   9:15    1      6    False
18  22000394   9:15    1      3    False
19  22000394   8:53    1      1    False
20  22000394   8:33    1      1    False
21  22000394   8:22    1      1    False
22  22000394   8:16    1      2    False
23  22000394   8:16    1      4    False
24  22000394   8:12    1      2    False

edited May 29 '22 at 23:42

answered May 29 '22 at 23:10

Rodalm

5,169
5
21

Apologies, I have added a sample dataframe – radio23 May 29 '22 at 23:14
`ValueError: cannot join with no overlapping index names` on line `df.loc[cond1 & cond2, 'UPDATED'] = True` – radio23 May 29 '22 at 23:18
@radio23 it should work now. Does it solve your issue? – Rodalm May 29 '22 at 23:26
I don't think this is the right output – sammywemmy May 29 '22 at 23:30
@sammywemmy what am I missing then? :p – Rodalm May 29 '22 at 23:31
2

it has to contain all three to be True, not just one of them – sammywemmy May 29 '22 at 23:32
ohhh of course... I see now, thanks for noticing. – Rodalm May 29 '22 at 23:33
Correct I think the `isin([1, 3, 6])` is causing issues – radio23 May 29 '22 at 23:33
give me a minute – Rodalm May 29 '22 at 23:33
@radio23 I think now it's correct – Rodalm May 29 '22 at 23:43
@radio23 Can you benchmark it against your solution and see if it's a good improvement? I'm curious – Rodalm May 29 '22 at 23:45
It takes ~38 seconds on my machine but I don't get any values set to `True` – radio23 May 29 '22 at 23:51

Faster way to update dataframe rows after group by

Sample Input:

Current Working Solution (Slow)

3 Answers3