5

I have a pandas dataframe as follows:

df = pd.DataFrame()
df['Name'] = ['Abby', 'Abby', 'Abby', 'Abby', 'Abby', 'Daniel', 'Daniel', 'Daniel', 'Daniel', 'Daniel']
df['Marks'] = [100, 90, 76, 50, 10, 50, 45, 38, 25, 5]

I want to:

  1. Find the 40th percentile for each group
  2. Filter the dataframe such that all the values above the 40th percentile for that group are shown.

So, I have found the 40th percentile for each group using:

df.groupby('Name').quantile(0.4)

The Aim is to get to:

enter image description here

My main issue is that the values for each group are not standardized and so I cannot apply an overall percentile value for the entire dataset.

But all the help I saw regarding filtering a dataframe with a certain value does not do it separately for each group. I have seen the following questions:

Pandas, groupby where column value is greater than x

Pandas Groupby apply function to count values greater than zero

My question essentially builds on a variation of the following question: Calculate Arbitrary Percentile on Pandas GroupBy

Is there a way to do this in Pandas?

DaytaSigntist
  • 117
  • 2
  • 8

2 Answers2

4

You can using transform

df[df.Marks>df.groupby('Name').Marks.transform('quantile',0.4)]
Out[712]: 
     Name  Marks
0    Abby    100
1    Abby     90
2    Abby     76
5  Daniel     50
6  Daniel     45
7  Daniel     38
BENY
  • 317,841
  • 20
  • 164
  • 234
  • is it possible to use this approach for the expanding quantile instead of the full-sample quantile? – Yuca Sep 19 '18 at 20:36
  • 1
    I came up with same solution, only 4 mins late:) – Vaishali Sep 19 '18 at 20:37
  • @Yuca what you mean the expending quantile ? I think we can using lambda here – BENY Sep 19 '18 at 20:57
  • i'm sure you know about rolling vs expanding in pandas. The expanding quantile would be the quantile available at each rows. I was just curious because I do that with numba and assign – Yuca Sep 19 '18 at 20:59
  • @Yuca `df.groupby('Name').Marks.expanding().quantile(0.5) ` – BENY Sep 19 '18 at 21:07
1

Using your code for percentiles, and loc, ge for >= (or gt, for >) and index matching:

df = df.set_index('Name')
df.loc[df.Marks.ge(df.groupby('Name').quantile(0.4).Marks)]

    Name    Marks
0   Abby    100
1   Abby    90
2   Abby    76
5   Daniel  50
6   Daniel  45
7   Daniel  38
rafaelc
  • 57,686
  • 15
  • 58
  • 82