Filter rows after groupby pandas

Question

I have a table in pandas:

import pandas as pd

df = pd.DataFrame({
    'LeafID':[1,1,2,1,3,3,1,6,3,5,1],
    'pidx':[10,10,300,10,30,40,20,10,30,45,20],
    'pidy':[20,20,400,20,15,20,12,43,54,112,23],
    'count':[10,20,30,40,80,10,20,50,30,10,70],
    'score':[10,10,10,22,22,3,4,5,9,0,1]
})

LeafID  count       pidx     pidy   score
0   1       10           10        20     10
1   1       20           10        20     10
2   2       30          300       400     10
3   1       40           10        20     22
4   3       80           30        15     22
5   3       10           40        20      3
6   1       20           20        12      4
7   6       50           10        43      5
8   3       30           20        54      9
9   5       10           45       112      0
10  1       70           20        23      1

I want to do a groupby and then filter the rows where occurrence of pidx is greater than 2.

That is, filter rows where pidx is 10 and 20.

I tried using df.groupby('pidx').count() but it didn't helped me. Also for those rows I have to do 0.4*count+0.6*score.

Desired output is:

LeafID    count       pidx     pidy    final_score
   1       10           10        20
   1       20           10        20
   1       40           10        20
   6       50           10        43
   1       20           20        12
   3       30           20        54
   1       70           20        23

@jezrael see edit , updated my desired output, i want all the rows where occurance of pidx is greater than 2, i'e 10 occurs 4 times and 2 occurs 3 times — Shubham R, Jan 24 '17 at 06:32
Pandas is a very powerful tool. Use indexing for cleaner code. See my answer. — V Shreyas, Jan 24 '17 at 07:01

score 10 · Answer 1 · answered Jan 24 '17 at 06:34

This is a straightforward application of filter after doing a groupby. In the data you provided, a value of 20 for pidx only occurred twice so it was filtered out.

df.groupby('pidx').filter(lambda x: len(x) > 2)

   LeafID  count  pidx  pidy
0       1     10    10    20
1       1     20    10    20
3       1     40    10    20
7       6     50    10    43

jezrael · Accepted Answer · 2017-01-24T06:57:36.880

You can use value_counts with boolean indexing and isin:

df = pd.DataFrame({
    'LeafID':[1,1,2,1,3,3,1,6,3,5,1],
    'pidx':[10,10,300,10,30,40,20,10,30,45,20],
    'pidy':[20,20,400,20,15,20,12,43,54,112,23],
    'count':[10,20,30,40,80,10,20,50,30,10,70],
    'score':[10,10,10,22,22,3,4,5,9,0,1]
})
print (df)
    LeafID  count  pidx  pidy  score
0        1     10    10    20     10
1        1     20    10    20     10
2        2     30   300   400     10
3        1     40    10    20     22
4        3     80    30    15     22
5        3     10    40    20      3
6        1     20    20    12      4
7        6     50    10    43      5
8        3     30    30    54      9
9        5     10    45   112      0
10       1     70    20    23      1

s = df.pidx.value_counts()
idx = s[s>2].index
print (df[df.pidx.isin(idx)])
   LeafID  count  pidx  pidy  score
0       1     10    10    20     10
1       1     20    10    20     10
3       1     40    10    20     22
7       6     50    10    43      5

Timings:

np.random.seed(123)
N = 1000000


L1 = list('abcdefghijklmnopqrstu')
L2 = list('efghijklmnopqrstuvwxyz')
df = pd.DataFrame({'LeafId':np.random.randint(1000, size=N),
                   'pidx': np.random.randint(10000, size=N),
                   'pidy': np.random.choice(L2, N),
                   'count':np.random.randint(1000, size=N)})
print (df)


print (df.groupby('pidx').filter(lambda x: len(x) > 120))

def jez(df):
    s = df.pidx.value_counts()
    return df[df.pidx.isin(s[s>120].index)]

print (jez(df))

In [55]: %timeit (df.groupby('pidx').filter(lambda x: len(x) > 120))
1 loop, best of 3: 1.17 s per loop

In [56]: %timeit (jez(df))
10 loops, best of 3: 141 ms per loop

In [62]: %timeit (df[df.groupby('pidx').pidx.transform('size') > 120])
10 loops, best of 3: 102 ms per loop

In [63]: %timeit (df[df.groupby('pidx').pidx.transform(len) > 120])
1 loop, best of 3: 685 ms per loop

In [64]: %timeit (df[df.groupby('pidx').pidx.transform('count') > 120])
10 loops, best of 3: 104 ms per loop

For final_score you can use:

df['final_score'] = df['count'].mul(.4).add(df.score.mul(.6))

Added a score column, where Also for those rows i have to do 0.4*count+0.6*score and make a new column final_score — Shubham R, Jan 24 '17 at 06:37
jezrael can we do this in a single step or do we have to do df.div('count').add(df.score) — Shubham R, Jan 24 '17 at 06:39
I think you need `df['count'].mul(.4).add(df.score.mul(.6))` — jezrael, Jan 24 '17 at 06:41
@SRingne - Solution above is more simplier, but slowier in large dataframe, please check timings. — jezrael, Jan 24 '17 at 06:50
@piRSquared - I test it and it return bad output, please check it. — jezrael, Jan 24 '17 at 07:13
so i guess df[df.groupby('pidx').pidx.transform('count') > 120] would be the best for larger dataframes right — Shubham R, Jan 24 '17 at 07:20
@SRingne jezrael put a lot of work into evaluating all the options and your choice of solution is made obvious because of it. I'd agree with this being the chosen answer. However, if my answer was helpful, you can upvote it without choosing it as the answer. — piRSquared, Jan 24 '17 at 07:36
@piRSquared - I test your numpy solution and it return `df` with `[1000000 rows x 4 columns]` and all other solution return `[27982 rows x 4 columns]`, so i think there has to be something wrong... :( — jezrael, Jan 24 '17 at 07:38

piRSquared · Answer 3 · 2017-01-24T07:40:38.710

3

pandas

df[df.groupby('pidx').pidx.transform('count') > 2]


   LeafID  count  pidx  pidy  score
0       1     10    10    20     10
1       1     20    10    20     10
3       1     40    10    20     22
7       6     50    10    43      5

edited Jan 24 '17 at 07:40

answered Jan 24 '17 at 06:33

piRSquared

285,575
57
475
624

V Shreyas · Answer 4 · 2017-01-24T06:59:51.390

First of all, your output shows you don't want to do a groupby. Read up on what groupby does. What you need is:

df2 = df[df['pidx']<=20]
df2.sort_index(by = 'pidx')

This will give you your exact result. Read up on pandas indexing and functions. In fact go and read the whole introduction on pandas. It will not take much time.

Row operations are also simple using indexing:

df2['final_score']= 0.4*df2['count'] + 0.6*df2['score']

Filter rows after groupby pandas

4 Answers4

Linked