7

I have a table in pandas:

import pandas as pd

df = pd.DataFrame({
    'LeafID':[1,1,2,1,3,3,1,6,3,5,1],
    'pidx':[10,10,300,10,30,40,20,10,30,45,20],
    'pidy':[20,20,400,20,15,20,12,43,54,112,23],
    'count':[10,20,30,40,80,10,20,50,30,10,70],
    'score':[10,10,10,22,22,3,4,5,9,0,1]
})

LeafID  count       pidx     pidy   score
0   1       10           10        20     10
1   1       20           10        20     10
2   2       30          300       400     10
3   1       40           10        20     22
4   3       80           30        15     22
5   3       10           40        20      3
6   1       20           20        12      4
7   6       50           10        43      5
8   3       30           20        54      9
9   5       10           45       112      0
10  1       70           20        23      1

I want to do a groupby and then filter the rows where occurrence of pidx is greater than 2.

That is, filter rows where pidx is 10 and 20.

I tried using df.groupby('pidx').count() but it didn't helped me. Also for those rows I have to do 0.4*count+0.6*score.

Desired output is:

LeafID    count       pidx     pidy    final_score
   1       10           10        20
   1       20           10        20
   1       40           10        20
   6       50           10        43
   1       20           20        12
   3       30           20        54
   1       70           20        23
Shaido
  • 27,497
  • 23
  • 70
  • 73
Shubham R
  • 7,382
  • 18
  • 53
  • 119

4 Answers4

10

This is a straightforward application of filter after doing a groupby. In the data you provided, a value of 20 for pidx only occurred twice so it was filtered out.

df.groupby('pidx').filter(lambda x: len(x) > 2)

   LeafID  count  pidx  pidy
0       1     10    10    20
1       1     20    10    20
3       1     40    10    20
7       6     50    10    43
Ted Petrou
  • 59,042
  • 19
  • 131
  • 136
7

You can use value_counts with boolean indexing and isin:

df = pd.DataFrame({
    'LeafID':[1,1,2,1,3,3,1,6,3,5,1],
    'pidx':[10,10,300,10,30,40,20,10,30,45,20],
    'pidy':[20,20,400,20,15,20,12,43,54,112,23],
    'count':[10,20,30,40,80,10,20,50,30,10,70],
    'score':[10,10,10,22,22,3,4,5,9,0,1]
})
print (df)
    LeafID  count  pidx  pidy  score
0        1     10    10    20     10
1        1     20    10    20     10
2        2     30   300   400     10
3        1     40    10    20     22
4        3     80    30    15     22
5        3     10    40    20      3
6        1     20    20    12      4
7        6     50    10    43      5
8        3     30    30    54      9
9        5     10    45   112      0
10       1     70    20    23      1

s = df.pidx.value_counts()
idx = s[s>2].index
print (df[df.pidx.isin(idx)])
   LeafID  count  pidx  pidy  score
0       1     10    10    20     10
1       1     20    10    20     10
3       1     40    10    20     22
7       6     50    10    43      5

Timings:

np.random.seed(123)
N = 1000000


L1 = list('abcdefghijklmnopqrstu')
L2 = list('efghijklmnopqrstuvwxyz')
df = pd.DataFrame({'LeafId':np.random.randint(1000, size=N),
                   'pidx': np.random.randint(10000, size=N),
                   'pidy': np.random.choice(L2, N),
                   'count':np.random.randint(1000, size=N)})
print (df)


print (df.groupby('pidx').filter(lambda x: len(x) > 120))

def jez(df):
    s = df.pidx.value_counts()
    return df[df.pidx.isin(s[s>120].index)]

print (jez(df))

In [55]: %timeit (df.groupby('pidx').filter(lambda x: len(x) > 120))
1 loop, best of 3: 1.17 s per loop

In [56]: %timeit (jez(df))
10 loops, best of 3: 141 ms per loop

In [62]: %timeit (df[df.groupby('pidx').pidx.transform('size') > 120])
10 loops, best of 3: 102 ms per loop

In [63]: %timeit (df[df.groupby('pidx').pidx.transform(len) > 120])
1 loop, best of 3: 685 ms per loop

In [64]: %timeit (df[df.groupby('pidx').pidx.transform('count') > 120])
10 loops, best of 3: 104 ms per loop

For final_score you can use:

df['final_score'] = df['count'].mul(.4).add(df.score.mul(.6))
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Added a score column, where Also for those rows i have to do 0.4*count+0.6*score and make a new column final_score – Shubham R Jan 24 '17 at 06:37
  • jezrael can we do this in a single step or do we have to do df.div('count').add(df.score) – Shubham R Jan 24 '17 at 06:39
  • I think you need `df['count'].mul(.4).add(df.score.mul(.6))` – jezrael Jan 24 '17 at 06:41
  • @SRingne - Solution above is more simplier, but slowier in large dataframe, please check timings. – jezrael Jan 24 '17 at 06:50
  • @piRSquared - I test it and it return bad output, please check it. – jezrael Jan 24 '17 at 07:13
  • so i guess df[df.groupby('pidx').pidx.transform('count') > 120] would be the best for larger dataframes right – Shubham R Jan 24 '17 at 07:20
  • Yes - it is `piRSquared` solution. – jezrael Jan 24 '17 at 07:21
  • @SRingne jezrael put a lot of work into evaluating all the options and your choice of solution is made obvious because of it. I'd agree with this being the chosen answer. However, if my answer was helpful, you can upvote it without choosing it as the answer. – piRSquared Jan 24 '17 at 07:36
  • 1
    @piRSquared - I test your numpy solution and it return `df` with `[1000000 rows x 4 columns]` and all other solution return `[27982 rows x 4 columns]`, so i think there has to be something wrong... :( – jezrael Jan 24 '17 at 07:38
3

pandas

df[df.groupby('pidx').pidx.transform('count') > 2]


   LeafID  count  pidx  pidy  score
0       1     10    10    20     10
1       1     20    10    20     10
3       1     40    10    20     22
7       6     50    10    43      5
piRSquared
  • 285,575
  • 57
  • 475
  • 624
0

First of all, your output shows you don't want to do a groupby. Read up on what groupby does. What you need is:

df2 = df[df['pidx']<=20]
df2.sort_index(by = 'pidx')

This will give you your exact result. Read up on pandas indexing and functions. In fact go and read the whole introduction on pandas. It will not take much time.

Row operations are also simple using indexing:

df2['final_score']= 0.4*df2['count'] + 0.6*df2['score']
V Shreyas
  • 449
  • 5
  • 19