1

Sample Input:


    GAME_ID     TIME    PER EVENT
0   0022000394  12:00   1   12
1   0022000394  12:00   1   10
2   0022000394  11:36   1   1
3   0022000394  11:24   1   1
4   0022000394  11:04   1   1
5   0022000394  10:41   1   1
6   0022000394  10:30   1   2
7   0022000394  10:29   1   4
8   0022000394  10:17   1   1
9   0022000394  10:01   1   1
10  0022000394  9:48    1   2
11  0022000394  9:46    1   4
12  0022000394  9:42    1   6
13  0022000394  9:42    1   3
14  0022000394  9:42    1   3
15  0022000394  9:25    1   1
16  0022000394  9:15    1   1
17  0022000394  9:15    1   6
18  0022000394  9:15    1   3
19  0022000394  8:53    1   1
20  0022000394  8:33    1   1
21  0022000394  8:22    1   1
22  0022000394  8:16    1   2
23  0022000394  8:16    1   4
24  0022000394  8:12    1   2

I have a dataframe where I get a group of rows using groupby.

If that group contains 3 rows where the column EVENTMSGTYPE contains all of [1, 6, & 3] I want to update the row in the original dataframe where EVENTMSGTYPE == 1

Current Working Solution (Slow)

# Group by
for _, data in df.groupby(['GAME_ID', 'TIME', 'PER']):

    # If EVENT in group contains 1, 6, and 3 then update original df
    if all(x in list(data.EVENT) for x in [1, 6, 3]):

        # Update original df row where EVENT equals 1, should only have one value
        index = data[data.EVENT == 1].index.values[0]

        # Set UPDATED to True
        df.at[index, 'UPDATED'] = True

Expected Output:

    GAME_ID     TIME    PER EVENT UPDATED
...
16  0022000394  9:15    1   1     True
...

My dataframe has 1,694,389 rows and this takes ~53 seconds to run on my machine, can the performance of this be improved?

radio23
  • 87
  • 1
  • 8
  • kindly provide sample dataframe, with expected output – sammywemmy May 29 '22 at 22:46
  • this guide has some good tips for creating dummy data that behaves like your problem: [how to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – Michael Delgado May 29 '22 at 23:03
  • Apologies, I have added a sample dataframe @sammywemmy – radio23 May 29 '22 at 23:14

3 Answers3

3
df['UPDATED'] = df.groupby(['GAME_ID', 'TIME', 'PER'])['EVENT'].filter(lambda x: set(x) >= {1,3,6}, dropna=False).eq(1)

Output:

     GAME_ID   TIME  PER  EVENT  UPDATED
0   22000394  12:00    1     12    False
1   22000394  12:00    1     10    False
2   22000394  11:36    1      1    False
3   22000394  11:24    1      1    False
4   22000394  11:04    1      1    False
5   22000394  10:41    1      1    False
6   22000394  10:30    1      2    False
7   22000394  10:29    1      4    False
8   22000394  10:17    1      1    False
9   22000394  10:01    1      1    False
10  22000394   9:48    1      2    False
11  22000394   9:46    1      4    False
12  22000394   9:42    1      6    False
13  22000394   9:42    1      3    False
14  22000394   9:42    1      3    False
15  22000394   9:25    1      1    False
16  22000394   9:15    1      1     True
17  22000394   9:15    1      6    False
18  22000394   9:15    1      3    False
19  22000394   8:53    1      1    False
20  22000394   8:33    1      1    False
21  22000394   8:22    1      1    False
22  22000394   8:16    1      2    False
23  22000394   8:16    1      4    False
24  22000394   8:12    1      2    False

Stole the set logic from sammywemmy~

BeRT2me
  • 12,699
  • 2
  • 13
  • 31
  • 1
    doesn't match expected output – sammywemmy May 29 '22 at 23:39
  • Only row 16 should be `True` – radio23 May 29 '22 at 23:41
  • The original without the set logic takes a lot longer than the original solution – radio23 May 29 '22 at 23:57
  • 1
    Using the set logic this is much faster, 16 seconds on my machine – radio23 May 29 '22 at 23:59
  • Why do you use ">=" instead of "=="? – Drakax May 30 '22 at 00:14
  • @Drakax Because, for example: `{1,3,6,8} >= {1,3,6}` is `True`, but `{1,3,6,8} == {1,3,6}` is `False` – BeRT2me May 30 '22 at 00:29
  • Ok thank you :) Could you please try if this one is faster "df['UPDATED'] = df.groupby(['GAME_ID', 'TIME', 'PER'])['EVENT'].transform(lambda x: set(x) >= {1,3,6} and x == 1)"? And I was wondering if we could have used "df.groupby(['GAME_ID', 'TIME', 'PER'])['EVENT'].unique().???" somehow? – Drakax May 30 '22 at 01:02
  • I couldn't tell you why, but the `transform` method is far slower than the `filter` method on the test dataset. Also, `unique()` and `set()` are very similar in the type of result they produce, but `set()` will have a lot less bloat. – BeRT2me May 30 '22 at 01:36
2

One option is using set with transform; speed wise I expect Bert2ME's solution to be faster:

df.assign(UPDATED = df.groupby(grouper)
                      .EVENT
                      .transform(lambda x: set(x) >= {1,3,6}) 
                                           & df.EVENT.eq(1))

     GAME_ID   TIME  PER  EVENT  UPDATED
0   22000394  12:00    1     12    False
1   22000394  12:00    1     10    False
2   22000394  11:36    1      1    False
3   22000394  11:24    1      1    False
4   22000394  11:04    1      1    False
5   22000394  10:41    1      1    False
6   22000394  10:30    1      2    False
7   22000394  10:29    1      4    False
8   22000394  10:17    1      1    False
9   22000394  10:01    1      1    False
10  22000394   9:48    1      2    False
11  22000394   9:46    1      4    False
12  22000394   9:42    1      6    False
13  22000394   9:42    1      3    False
14  22000394   9:42    1      3    False
15  22000394   9:25    1      1    False
16  22000394   9:15    1      1     True
17  22000394   9:15    1      6    False
18  22000394   9:15    1      3    False
19  22000394   8:53    1      1    False
20  22000394   8:33    1      1    False
21  22000394   8:22    1      1    False
22  22000394   8:16    1      2    False
23  22000394   8:16    1      4    False
24  22000394   8:12    1      2    False
sammywemmy
  • 27,093
  • 4
  • 17
  • 31
1
idx_cols = ['GAME_ID', 'TIME', 'PER']

df = df.set_index(idx_cols)

cond1 = (
    df.groupby(level=idx_cols)['EVENT']
      .agg(lambda event_group: all(x in event_group for x in [1, 6, 3])) 
      .reindex_like(df)
)

cond2 = df['EVENT'].eq(1)

df['UPDATED'] = cond1 & cond2

df = df.reset_index()
print(df)

Output:

     GAME_ID   TIME  PER  EVENT  UPDATED
0   22000394  12:00    1     12    False
1   22000394  12:00    1     10    False
2   22000394  11:36    1      1    False
3   22000394  11:24    1      1    False
4   22000394  11:04    1      1    False
5   22000394  10:41    1      1    False
6   22000394  10:30    1      2    False
7   22000394  10:29    1      4    False
8   22000394  10:17    1      1    False
9   22000394  10:01    1      1    False
10  22000394   9:48    1      2    False
11  22000394   9:46    1      4    False
12  22000394   9:42    1      6    False
13  22000394   9:42    1      3    False
14  22000394   9:42    1      3    False
15  22000394   9:25    1      1    False
16  22000394   9:15    1      1     True
17  22000394   9:15    1      6    False
18  22000394   9:15    1      3    False
19  22000394   8:53    1      1    False
20  22000394   8:33    1      1    False
21  22000394   8:22    1      1    False
22  22000394   8:16    1      2    False
23  22000394   8:16    1      4    False
24  22000394   8:12    1      2    False
Rodalm
  • 5,169
  • 5
  • 21