4

I have a df and I want to filter out a column based on a grouping. I want to keep group by combinations ((cc, odd, tree1, and tree2) if day > 4, then keep it, otherwise drop it

df = pd.DataFrame()
df['cc'] = ['BB', 'BB', 'BB', 'BB','BB', 'BB','BB', 'BB', 'DD', 'DD', 'DD', 'DD', 'DD', 'DD', 'DD', 'DD', 'ZZ', 'ZZ', 'ZZ', 'ZZ', 'ZZ', 'ZZ', 'ZZ', 'ZZ']
df['odd'] = [3434, 3434, 3434, 3434, 3435, 3435, 3435, 3435, 3434, 3434, 3434, 3434, 3435, 3435, 3435, 3435, 3434, 3434, 3434, 3434, 3435, 3435, 3435, 3435]
df['tree1'] = ['ASP', 'ASP', 'ASP', 'ASP', 'SAP', 'SAP', 'SAP', 'SAP', 'ASP', 'ASP', 'ASP', 'ASP', 'SAP', 'SAP', 'SAP', 'SAP', 'ASP', 'ASP', 'ASP', 'ASP', 'SAP', 'SAP', 'SAP', 'SAP']
df['tree2'] = ['ATK', 'ATK','ATK','ATK','ATK','ATK','ATK','ATK', 'ATK', 'ATK','ATK','ATK','ATK','ATK','ATK','ATK', 'ATK', 'ATK','ATK','ATK','ATK','ATK','ATK','ATK']
df['day'] = [1, 2, 3, 4, 3, 4, 5, 6, 2, 3, 4, 5, 1, 3, 5, 7, 1, 2, 6, 8, 2, 4, 6, 8]
df

I tried this but this drops any row with day value smaller than 4

df_grouped = df.groupby(['cc', 'odd', 'tree1', 'tree2']).filter(df['day'] > 4)

I get this error TypeError: 'Series' object is not callable

And tried this

df_grouped = df.groupby(['cc', 'odd', 'tree1', 'tree2']).filter(lambda x: x['day'] > 4)

I get this error TypeError: filter function returned a Series, but expected a scalar bool.

I searched and tried to solve these errors but the proposed solution did not work for me. I would like to get a df as below:

df1 = pd.DataFrame()
df1['cc'] = ['BB', 'BB','BB', 'BB', 'DD', 'DD', 'DD', 'DD', 'DD', 'DD', 'DD', 'DD', 'ZZ', 'ZZ', 'ZZ', 'ZZ', 'ZZ', 'ZZ', 'ZZ', 'ZZ']
df1['odd'] = [3435, 3435, 3435, 3435, 3434, 3434, 3434, 3434, 3435, 3435, 3435, 3435, 3434, 3434, 3434, 3434, 3435, 3435, 3435, 3435]
df1['tree1'] = ['SAP', 'SAP', 'SAP', 'SAP', 'ASP', 'ASP', 'ASP', 'ASP', 'SAP', 'SAP', 'SAP', 'SAP', 'ASP', 'ASP', 'ASP', 'ASP', 'SAP', 'SAP', 'SAP', 'SAP']
df1['tree2'] = ['ATK','ATK','ATK','ATK', 'ATK', 'ATK','ATK','ATK','ATK','ATK','ATK','ATK', 'ATK', 'ATK','ATK','ATK','ATK','ATK','ATK','ATK']
df1['day'] = [3, 4, 5, 6, 2, 3, 4, 5, 1, 3, 5, 7, 1, 2, 6, 8, 2, 4, 6, 8]
df1

I have tried to use the logical function of any but I could not make it work, it returns only True or False to me instead of a filtered dataframe.

cs95
  • 379,657
  • 97
  • 704
  • 746
Zmnako Awrahman
  • 538
  • 7
  • 19
  • 2
    Can't you just filter the df first prior to grouping?. Do you even need to `groupby` after filtering? `groupby` is for aggregating generally – EdChum Jun 05 '18 at 14:36
  • 1
    No, the filtering works on the groups, I want to keep the rows that has values bigger than 4 based on the groupby because the groupby defines the grouping of the day. If I filter then groupby it will drop all rows that are 4 or smaller which is something I don't want – Zmnako Awrahman Jun 05 '18 at 14:40
  • Your question is then ill-formed you want to filter by group size correct? – EdChum Jun 05 '18 at 14:41
  • Not group size, I want to check the max day value based on groupby, if it is bigger than 4, then keep the rows of this group, if not, drop the all rows of these group. I think my question cannot be clearer :) – Zmnako Awrahman Jun 05 '18 at 14:43
  • @coldspeed why it is duplicate? am not asking for a simple group filtering, please read my question again. The duplicate question does not provide solution to my problem at all. – Zmnako Awrahman Jun 05 '18 at 14:47
  • As EdChum said, you can filter before grouping, Why do it after? – cs95 Jun 05 '18 at 14:48
  • IIUC you want `df_grouped = df.groupby(['cc', 'odd', 'tree1', 'tree2']).filter(lambda x: (x['day'] > 4).any())` @coldspeed I think the OP wants to filter the groups based on membership, so filter the groups if none of the days are greater than 4 for that grouping, you could filter prior to the grouping – EdChum Jun 05 '18 at 14:49
  • @coldspeed because it removes all rows which have values of 4 and smaller. I do not want that. I want only to remove the groups that max value is 4. Please see the two dataframe to see the difference. Filtering will remove group 1 and row from any other groups that it 4 or smaller. – Zmnako Awrahman Jun 05 '18 at 14:50
  • Okay, one second. – cs95 Jun 05 '18 at 14:51
  • Did you try `df_grouped = df.groupby(['cc', 'odd', 'tree1', 'tree2']).filter(lambda x: (x['day'] > 4).any())` I think this is what you want – EdChum Jun 05 '18 at 14:52
  • I've reopened the question. Pardon my misunderstanding. – cs95 Jun 05 '18 at 14:57
  • Many thanks, @EdChum. your solution works – Zmnako Awrahman Jun 05 '18 at 17:03

2 Answers2

4

Now that I've understood what you want, let's try something like transform + any:

df[df.assign(key=df.day > 4)
     .groupby(['cc', 'odd', 'tree1', 'tree2']).key.transform('any')
]

Or,

df[df.day.gt(4).groupby([df.cc, df.odd, df.tree1, df.tree2]).transform('any')]

    cc   odd tree1 tree2  day
4   BB  3435   SAP   ATK    3
5   BB  3435   SAP   ATK    4
6   BB  3435   SAP   ATK    5
7   BB  3435   SAP   ATK    6
8   DD  3434   ASP   ATK    2
9   DD  3434   ASP   ATK    3
10  DD  3434   ASP   ATK    4
11  DD  3434   ASP   ATK    5
12  DD  3435   SAP   ATK    1
13  DD  3435   SAP   ATK    3
14  DD  3435   SAP   ATK    5
15  DD  3435   SAP   ATK    7
16  ZZ  3434   ASP   ATK    1
17  ZZ  3434   ASP   ATK    2
18  ZZ  3434   ASP   ATK    6
19  ZZ  3434   ASP   ATK    8
20  ZZ  3435   SAP   ATK    2
21  ZZ  3435   SAP   ATK    4
22  ZZ  3435   SAP   ATK    6
23  ZZ  3435   SAP   ATK    8
cs95
  • 379,657
  • 97
  • 704
  • 746
  • Your first method is the faster method here, I added timings, it looks like the `transform` is faster at coercing back to the orig df than using `filter` – EdChum Jun 05 '18 at 14:59
  • @EdChum Thank you! – cs95 Jun 05 '18 at 14:59
  • @coldspeed both solutions work like a charm, thanks. As Edchum wrote the first solution is much faster when I run on 25 million rows df – Zmnako Awrahman Jun 05 '18 at 17:09
2

IIUC you want:

In[116]:
df_grouped = df.groupby(['cc', 'odd', 'tree1', 'tree2']).filter(lambda x: (x['day'] > 4).any())
df_grouped

Out[116]: 
    cc   odd tree1 tree2  day
4   BB  3435   SAP   ATK    3
5   BB  3435   SAP   ATK    4
6   BB  3435   SAP   ATK    5
7   BB  3435   SAP   ATK    6
8   DD  3434   ASP   ATK    2
9   DD  3434   ASP   ATK    3
10  DD  3434   ASP   ATK    4
11  DD  3434   ASP   ATK    5
12  DD  3435   SAP   ATK    1
13  DD  3435   SAP   ATK    3
14  DD  3435   SAP   ATK    5
15  DD  3435   SAP   ATK    7
16  ZZ  3434   ASP   ATK    1
17  ZZ  3434   ASP   ATK    2
18  ZZ  3434   ASP   ATK    6
19  ZZ  3434   ASP   ATK    8
20  ZZ  3435   SAP   ATK    2
21  ZZ  3435   SAP   ATK    4
22  ZZ  3435   SAP   ATK    6
23  ZZ  3435   SAP   ATK    8

So this will filter out the groups where within the group none of the 'day' values are greater than 4

timings:

%timeit df[df.day.gt(4).groupby([df.cc, df.odd, df.tree1, df.tree2]).transform('any')]
%timeit df.groupby(['cc', 'odd', 'tree1', 'tree2']).filter(lambda x: (x['day'] > 4).any())
%timeit df[df.assign(key=df.day > 4).groupby(['cc', 'odd', 'tree1', 'tree2']).key.transform('any')]
100 loops, best of 3: 5.9 ms per loop
100 loops, best of 3: 5.42 ms per loop
100 loops, best of 3: 3.62 ms per loop

So @coldspeed's first method is the fastest here

EdChum
  • 376,765
  • 198
  • 813
  • 562
  • If you can group on the boolean column, you can avoid the lambda totally. Nice answer tho. – cs95 Jun 05 '18 at 14:54
  • @coldspeed sure, will try that – EdChum Jun 05 '18 at 14:54
  • (oh, that's in my answer but sure you can try something similar) – cs95 Jun 05 '18 at 14:55
  • 1
    @coldspeed ah right, will leave it then, I'm a bit old school when it comes to `pandas` these days – EdChum Jun 05 '18 at 14:55
  • @EdChum - yes, I have similar experience, `filter` is obviously very slow - `https://stackoverflow.com/a/47415132/2901002` – jezrael Jun 05 '18 at 15:03
  • 2
    @jezrael I'd have to look at the source but I'm guessing that `filter` is multi-purpose in that it's looking to generate a boolean series for each group to determine membership and recreate the boolean mask for the groups and return a df, with `transform` you know you just have to generate a boolean mask for the original df with the same index so it's less guesswork and more scope for optimisation – EdChum Jun 05 '18 at 15:07
  • It seems @coldspeed solution are 4 times faster on a 25 million rows df `1 loop, best of 3: 21.7 s per loop` (EdChum) `1 loop, best of 3: 5.8 s per loop` (coldspeed1) `1 loop, best of 3: 5.52 s per loop` (coldspeed2) – Zmnako Awrahman Jun 05 '18 at 17:21