Pass percentiles to pandas agg function

Question

I want to pass the numpy percentile() function through pandas' agg() function as I do below with various other numpy statistics functions.

Right now I have a dataframe that looks like this:

AGGREGATE   MY_COLUMN
A           10
A           12
B           5
B           9
A           84
B           22

And my code looks like this:

grouped = dataframe.groupby('AGGREGATE')
column = grouped['MY_COLUMN']
column.agg([np.sum, np.mean, np.std, np.median, np.var, np.min, np.max])

The above code works, but I want to do something like

column.agg([np.sum, np.mean, np.percentile(50), np.percentile(95)])

I.e., specify various percentiles to return from agg().

How should this be done?

This is probably a newer aspect of Pandas but look at http://stackoverflow.com/questions/19894939/calculate-arbitrary-percentile-on-pandas-groupby. TLDR: df.groupby('C').quantile(.95) — sachinruk, Aug 25 '16 at 23:50

score 147 · Accepted Answer · edited Jul 05 '23 at 15:47

147

Perhaps not super efficient, but one way would be to create a function yourself:

def percentile(n):
    def percentile_(x):
        return x.quantile(n)
    percentile_.__name__ = 'percentile_{:02.0f}'.format(n*100)
    return percentile_

Then include this in your agg:

In [11]: column.agg([np.sum, np.mean, np.std, np.median,
                     np.var, np.min, np.max, percentile(50), percentile(95)])
Out[11]:
           sum       mean        std  median          var  amin  amax  percentile_50  percentile_95
AGGREGATE
A          106  35.333333  42.158431      12  1777.333333    10    84             12           76.8
B           36  12.000000   8.888194       9    79.000000     5    22             12           76.8

Note sure this is how it should be done though...

edited Jul 05 '23 at 15:47

Roelant

4,508
1
32
62

answered Jul 10 '13 at 18:51

Andy Hayden

359,921
101
625
535

5

This had multiple issues for me, see [my answer below](https://stackoverflow.com/a/54593214/4629950). – Thomas Feb 08 '19 at 13:14
1

Since this is the accepted answer, so many people will just blind copy it, I've updated the function @Thomas. – Roelant Jul 05 '23 at 15:48

score 76 · Answer 2 · edited Dec 19 '20 at 12:19

76

You can have agg() use a custom function to be executed on specified column:

# 50th Percentile
def q50(x):
    return x.quantile(0.5)

# 90th Percentile
def q90(x):
    return x.quantile(0.9)

my_DataFrame.groupby(['AGGREGATE']).agg({'MY_COLUMN': [q50, q90, 'max']})

edited Dec 19 '20 at 12:19

Jaroslav Bezděk

6,967
6
29
46

answered Jul 18 '19 at 06:02

Arun Nalpet

1,190
1
9
20

prl900 · Answer 3 · 2013-11-27T03:01:11.350

37

Being more specific, if you just want to aggregate your pandas groupby results using the percentile function, the python lambda function offers a pretty neat solution. Using the question's notation, aggregating by the percentile 95, should be:

dataframe.groupby('AGGREGATE').agg(lambda x: np.percentile(x['COL'], q = 95))

You can also assign this function to a variable and use it in conjunction with other aggregation functions.

edited Nov 27 '13 at 03:01

answered Nov 27 '13 at 02:55

prl900

4,029
4
33
40

I'm getting the error TypeError: Must provide 'func' or tuples of '(column, aggfunc). Any idea what might be happening? – Dumb ML Aug 19 '20 at 12:12
Although this looks pretty but def. efficient if you work with big data – Areza Mar 30 '21 at 09:02

jvans · Answer 4 · 2019-11-01T21:31:14.513

27

I believe the idiomatic way to do this in pandas is:

df.groupby("AGGREGATE").quantile([0, 0.25, 0.5, 0.75, 0.95, 1])

edited Nov 01 '19 at 21:31

answered Oct 29 '19 at 15:22

jvans

2,765
2
22
23

score 23 · Answer 5 · edited Jul 05 '23 at 15:23

23

I really like the solution Andy Hayden gave, however, this had multiple issues for me:

If the dataframe has multiple columns, it aggregated over the columns instead of over the rows?
For me, the row names were percentile_0.5 (dot instead of underscore). Not sure what caused this, probably that I am using Python 3.
Need to import numpy as well instead of staying in pandas (I know, numpy is imported implicitely in pandas...)

Here is an updated version that fixes these issues:

def percentile(n):
    def percentile_(x):
        return x.quantile(n)
    percentile_.__name__ = 'percentile_{:02.0f}'.format(n*100)
    return percentile_

edited Jul 05 '23 at 15:23

Roelant

4,508
1
32
62

answered Feb 08 '19 at 13:13

Thomas

4,696
5
36
71

2

Do you intend `return x.quantile(n)` in your version? – lighthouse65 Jun 02 '19 at 22:28
Nice catch! I definitely did, thanks for mentioning it. I'll edit it. – Thomas Jun 03 '19 at 13:26
4

I think the format `{:02.0f}` would be better to avoid spaces for single digit percent values. – ingomueller.net Sep 26 '19 at 13:37

score 20 · Answer 6 · edited Jul 06 '20 at 21:46

20

Try this for the 50% and 95% percentile:

column.describe(percentiles=[0.5, 0.95])

edited Jul 06 '20 at 21:46

Hugolmn

1,530
1
7
20

answered Oct 02 '17 at 19:39

scottlittle

18,866
8
51
70

score 14 · Answer 7 · answered May 30 '19 at 18:08

For situations where all you need is a subset of the describe (typically the most common needed statistics) you can just index the returned pandas series without needing any extra functions.

For example, I commonly find myself just needing to present the 25th, median, 75th and count. This can be done in just one line like so:

columns.agg('describe')[['25%', '50%', '75%', 'count']]

For specifying your own set of percentiles, the chosen answer is a good choice, but for simple use case, there is no need for extra functions.

score 12 · Answer 8 · edited Sep 23 '20 at 00:14

12

df.groupby("AGGREGATE").describe(percentiles=[0, 0.25, 0.5, 0.75, 0.95, 1])

by default describe function give us mean, count, std, min, max, and with percentiles array you can choose the needed percentiles.

edited Sep 23 '20 at 00:14

user2161065

1,826
1
18
18

answered Dec 18 '19 at 16:27

Agredalopez

121
1
2

score 10 · Answer 9 · answered Mar 16 '21 at 16:30

More efficient solution with pandas.Series.quantile method:

df.groupby("AGGREGATE").agg(("YOUR_COL_NAME", lambda x: x.quantile(0.5))

With several percentile values

percentiles = [0.5, 0.9, 0.99]
quantile_funcs = [(p, lambda x: x.quantile(p)) for p in percentiles]
df.groupby("AGGREGATE").agg(quantile_funcs)

score 6 · Answer 10 · answered Mar 17 '20 at 10:47

Just to throw a more general solution into the ring. Assume you have a DF with just one column to group:

df = pd.DataFrame((('A',10),('A',12),('B',5),('B',9),('A',84),('B',22)), 
                    columns=['My_KEY', 'MY_COL1'])

One can aggregate and calcualte basically any descriptive metric with a list of anonymous (lambda) functions like:

df.groupby(['My_KEY']).agg( [np.sum, np.mean, lambda x: np.percentile(x, q=25)] )

However, if you have multiple columns to aggregate, you have to call a non anonymous function or call the columns explicitly:

df = pd.DataFrame((('A',10,3),('A',12,4),('B',5,6),('B',9,3),('A',84,2),('B',22,1)), 
                    columns=['My_KEY', 'MY_COL1', 'MY_COL2'])

# non-anonymous function
def percentil25 (x): 
    return np.percentile(x, q=25)

# type 1: call for both columns 
df.groupby(['My_KEY']).agg( [np.sum, np.mean, percentil25 ]  )

# type 2: call each column separately
df.groupby(['My_KEY']).agg( {'MY_COL1': [np.sum, np.mean, lambda x: np.percentile(x, q=25)],
                             'MY_COL2': np.size})

score 6 · Answer 11 · answered Jun 05 '21 at 07:17

6

You can also perhaps use lambda to achieve the same. Some thing like below piece of code :

        agg(
            lambda x: [
                np.min(a=x), 
                np.percentile(q=25,a=x), 
                np.median(a=x), 
                np.percentile(q=75,a=x), 
                np.max(a=x)
    ]
)

answered Jun 05 '21 at 07:17

Ankit Khandelwal

71
1
2

how is this different from the accepted answer? – StupidWolf Jun 05 '21 at 08:21
Well, only difference is, you dont need to define a new function. Saves some lines of Code. – Ankit Khandelwal Jun 06 '21 at 04:19
1

How do you name those function headers? like np.min(a=x) how do you name the header for that function? – Shivam Sahil Jul 19 '21 at 08:02

Jose Rondon · Answer 12 · 2022-06-08T07:00:43.377

4

This can provide some customization:

list_statistics = ['count','mean','min',lambda x: np.percentile(x,q=25),'max',lambda x: np.percentile(x,q=75)]
cols_to_rename = {'<lambda_0>':'P25','<lambda_1>':'P75'}
df_out.groupby('Country').agg(list_statistics).rename(columns=cols_to_rename)

edited Jun 08 '22 at 07:00

answered Apr 29 '22 at 11:12

Jose Rondon

370
2
6
13

score 0 · Answer 13 · edited Nov 15 '18 at 08:10

0

Multiple function can be called as below:

import pandas as pd

import numpy as np

import random

C = ['Ram', 'Ram', 'Shyam', 'Shyam', 'Mahima', 'Ram', 'Ram', 'Shyam', 'Shyam', 'Mahima']

A = [ random.randint(0,100) for i in range(10) ]

B = [ random.randint(0,100) for i in range(10) ]

df = pd.DataFrame({ 'field_A': A, 'field_B': B, 'field_C': C })

print(df)

d = df.groupby('field_C')['field_A'].describe()[['mean', 'count', '25%', '50%', '75%']]
print(d)

I was unable to call median in this, but able to work other functions.

edited Nov 15 '18 at 08:10

Sangam Belose

4,262
8
26
48

answered Nov 15 '18 at 06:21

Fakira

9
2

1

this calls all of them, but selects a few. this is bad for performance, which is the reason why you would use `agg` over describe. – Sebastian Wozny May 10 '19 at 18:15
@SebastianWozny may be you can update your comment on which solution do you recommend when dealing with big data – Areza Mar 30 '21 at 09:03

רועי שמש · Answer 14 · 2023-03-04T19:16:14.033

You can define simple lambda function

my_quant = lambda _q: lambda _data:_data.quantile(_q)

the use it with the required quantiles

df.aggregate([my_quant(0.1),my_quant(0.3),'median',my_quant(0.7)]

** If you want to keep the aggregations name without renaming the indexes you can create lambda creator and set the function names

def lambda_creator(func,name):
    func.__name__ = name
    return func

my_quant = lambda _q: lambda_creator(lambda _data: _data.quantile(_q),f'quantile_{_q}')

score 0 · Answer 15 · answered May 18 '23 at 09:45

0

Get Q1 result using agg from dataframe groupby, recommend below also.

import numpy as np 
df.groupby(['col1','col2']).agg(quantile_columns = ('column_name', lambda x: np.percentile(x, q=25)))

answered May 18 '23 at 09:45

johanna

401
4
6

score -1 · Answer 16 · answered Jul 29 '22 at 10:37

In case you have a dataframe with several columns and only want the quantiles for one column:

df.groupby("AGGREGATE")['MY_COLUMN'].quantile([0, 0.25, 0.5, 0.75, 0.95, 1])

and in case you want a 1 level dataframe you can add:

df.groupby("AGGREGATE")['MY_COLUMN'].quantile([0, 0.25, 0.5, 0.75, 0.95, 1]).reset_index()

Pass percentiles to pandas agg function

16 Answers16

Linked

Related