1

For the record, I've read these following threads but none of them seems to fulfill my need:

Say I have this following table df:

 user_id  is_manually  created_per_week
----------------------------------------
    10       True             59
    10      False             90
    33       True              0
    33      False             64
    50       True              0
    50      False              0

I want to exclude the users who have created nothing, i.e. created_per_week is 0 in both rows of is_manually True and False, which is user 50 in this case.

 user_id  is_manually  created_per_week
----------------------------------------
    10       True             59
    10      False             90
    33       True              0
    33      False             64

I learned that df.groupby doesn't have query method and should use apply instead.

The closest answer I've got is df.groupby("user_id").apply(lambda x: x[x["created_per_week"] > 0]), but it also excludes the row of user 33 manually True, which is undesirable. I've also tried df.groupby("user_id").apply(lambda x: x[any(x["created_per_week"] > 0)]) but it returns a KeyError.

In other words, I am searching the equivalence of df %>% group_by(user_id) %>% filter(any(created_per_week > 0)) in R. Thanks.

ytu
  • 1,822
  • 3
  • 19
  • 42

2 Answers2

1

transform + any

df[df.assign(New=df.created_per_week==0).groupby('user_id').created_per_week.transform('any')]
Out[425]: 
   user_id  is_manually  created_per_week
0       10         True                59
1       10        False                90
2       33         True                 0
3       33        False                64

Or simply by using loc+isin

df.loc[df.user_id.isin(df[df.created_per_week!=0].user_id)]
Out[426]: 
   user_id  is_manually  created_per_week
0       10         True                59
1       10        False                90
2       33         True                 0
3       33        False                64

From PiR

f, u = pd.factorize(df.user_id); df[np.bincount(f, df.created_per_week)[f] > 0]
BENY
  • 317,841
  • 20
  • 164
  • 234
0

You can apply groupby then filter command to get the output.

df.groupby('user_id').filter(lambda x: (x['created_per_week'] != 0).any())

    user_id is_manually created_per_week
0       10        True                59
1       10       False                90
2       33        True                 0
3       33       False                64
YOLO
  • 20,181
  • 5
  • 20
  • 40
  • So `DataFrameGroupBy.filter` does quite different thing with `DataFrame.filter`... How confusing. Thanks, anyway! – ytu Mar 21 '18 at 03:21
  • Yes, `DataFrame.filter` is for selecting column or indexes based on their names. This one is to select rows given a condition. It's confusing. – YOLO Mar 21 '18 at 10:51