109

I want to pass the numpy percentile() function through pandas' agg() function as I do below with various other numpy statistics functions.

Right now I have a dataframe that looks like this:

AGGREGATE   MY_COLUMN
A           10
A           12
B           5
B           9
A           84
B           22

And my code looks like this:

grouped = dataframe.groupby('AGGREGATE')
column = grouped['MY_COLUMN']
column.agg([np.sum, np.mean, np.std, np.median, np.var, np.min, np.max])

The above code works, but I want to do something like

column.agg([np.sum, np.mean, np.percentile(50), np.percentile(95)])

I.e., specify various percentiles to return from agg().

How should this be done?

Alex Waygood
  • 6,304
  • 3
  • 24
  • 46
slizb
  • 5,742
  • 4
  • 25
  • 22
  • 7
    This is probably a newer aspect of Pandas but look at http://stackoverflow.com/questions/19894939/calculate-arbitrary-percentile-on-pandas-groupby. TLDR: df.groupby('C').quantile(.95) – sachinruk Aug 25 '16 at 23:50

16 Answers16

147

Perhaps not super efficient, but one way would be to create a function yourself:

def percentile(n):
    def percentile_(x):
        return x.quantile(n)
    percentile_.__name__ = 'percentile_{:02.0f}'.format(n*100)
    return percentile_

Then include this in your agg:

In [11]: column.agg([np.sum, np.mean, np.std, np.median,
                     np.var, np.min, np.max, percentile(50), percentile(95)])
Out[11]:
           sum       mean        std  median          var  amin  amax  percentile_50  percentile_95
AGGREGATE
A          106  35.333333  42.158431      12  1777.333333    10    84             12           76.8
B           36  12.000000   8.888194       9    79.000000     5    22             12           76.8

Note sure this is how it should be done though...

Roelant
  • 4,508
  • 1
  • 32
  • 62
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • 5
    This had multiple issues for me, see [my answer below](https://stackoverflow.com/a/54593214/4629950). – Thomas Feb 08 '19 at 13:14
  • 1
    Since this is the accepted answer, so many people will just blind copy it, I've updated the function @Thomas. – Roelant Jul 05 '23 at 15:48
76

You can have agg() use a custom function to be executed on specified column:

# 50th Percentile
def q50(x):
    return x.quantile(0.5)

# 90th Percentile
def q90(x):
    return x.quantile(0.9)

my_DataFrame.groupby(['AGGREGATE']).agg({'MY_COLUMN': [q50, q90, 'max']})
Jaroslav Bezděk
  • 6,967
  • 6
  • 29
  • 46
Arun Nalpet
  • 1,190
  • 1
  • 9
  • 20
37

Being more specific, if you just want to aggregate your pandas groupby results using the percentile function, the python lambda function offers a pretty neat solution. Using the question's notation, aggregating by the percentile 95, should be:

dataframe.groupby('AGGREGATE').agg(lambda x: np.percentile(x['COL'], q = 95))

You can also assign this function to a variable and use it in conjunction with other aggregation functions.

prl900
  • 4,029
  • 4
  • 33
  • 40
  • I'm getting the error TypeError: Must provide 'func' or tuples of '(column, aggfunc). Any idea what might be happening? – Dumb ML Aug 19 '20 at 12:12
  • Although this looks pretty but def. efficient if you work with big data – Areza Mar 30 '21 at 09:02
27

I believe the idiomatic way to do this in pandas is:

df.groupby("AGGREGATE").quantile([0, 0.25, 0.5, 0.75, 0.95, 1])
jvans
  • 2,765
  • 2
  • 22
  • 23
23

I really like the solution Andy Hayden gave, however, this had multiple issues for me:

  • If the dataframe has multiple columns, it aggregated over the columns instead of over the rows?
  • For me, the row names were percentile_0.5 (dot instead of underscore). Not sure what caused this, probably that I am using Python 3.
  • Need to import numpy as well instead of staying in pandas (I know, numpy is imported implicitely in pandas...)

Here is an updated version that fixes these issues:

def percentile(n):
    def percentile_(x):
        return x.quantile(n)
    percentile_.__name__ = 'percentile_{:02.0f}'.format(n*100)
    return percentile_
Roelant
  • 4,508
  • 1
  • 32
  • 62
Thomas
  • 4,696
  • 5
  • 36
  • 71
20

Try this for the 50% and 95% percentile:

column.describe(percentiles=[0.5, 0.95])
Hugolmn
  • 1,530
  • 1
  • 7
  • 20
scottlittle
  • 18,866
  • 8
  • 51
  • 70
14

For situations where all you need is a subset of the describe (typically the most common needed statistics) you can just index the returned pandas series without needing any extra functions.

For example, I commonly find myself just needing to present the 25th, median, 75th and count. This can be done in just one line like so:

columns.agg('describe')[['25%', '50%', '75%', 'count']]

For specifying your own set of percentiles, the chosen answer is a good choice, but for simple use case, there is no need for extra functions.

Maxim
  • 725
  • 1
  • 8
  • 24
12
df.groupby("AGGREGATE").describe(percentiles=[0, 0.25, 0.5, 0.75, 0.95, 1])

by default describe function give us mean, count, std, min, max, and with percentiles array you can choose the needed percentiles.

user2161065
  • 1,826
  • 1
  • 18
  • 18
Agredalopez
  • 121
  • 1
  • 2
10

More efficient solution with pandas.Series.quantile method:

df.groupby("AGGREGATE").agg(("YOUR_COL_NAME", lambda x: x.quantile(0.5))

With several percentile values

percentiles = [0.5, 0.9, 0.99]
quantile_funcs = [(p, lambda x: x.quantile(p)) for p in percentiles]
df.groupby("AGGREGATE").agg(quantile_funcs)
Antiez
  • 679
  • 7
  • 11
6

Just to throw a more general solution into the ring. Assume you have a DF with just one column to group:

df = pd.DataFrame((('A',10),('A',12),('B',5),('B',9),('A',84),('B',22)), 
                    columns=['My_KEY', 'MY_COL1'])

One can aggregate and calcualte basically any descriptive metric with a list of anonymous (lambda) functions like:

df.groupby(['My_KEY']).agg( [np.sum, np.mean, lambda x: np.percentile(x, q=25)] )

However, if you have multiple columns to aggregate, you have to call a non anonymous function or call the columns explicitly:

df = pd.DataFrame((('A',10,3),('A',12,4),('B',5,6),('B',9,3),('A',84,2),('B',22,1)), 
                    columns=['My_KEY', 'MY_COL1', 'MY_COL2'])

# non-anonymous function
def percentil25 (x): 
    return np.percentile(x, q=25)

# type 1: call for both columns 
df.groupby(['My_KEY']).agg( [np.sum, np.mean, percentil25 ]  )

# type 2: call each column separately
df.groupby(['My_KEY']).agg( {'MY_COL1': [np.sum, np.mean, lambda x: np.percentile(x, q=25)],
                             'MY_COL2': np.size})
magraf
  • 420
  • 5
  • 8
6

You can also perhaps use lambda to achieve the same. Some thing like below piece of code :

        agg(
            lambda x: [
                np.min(a=x), 
                np.percentile(q=25,a=x), 
                np.median(a=x), 
                np.percentile(q=75,a=x), 
                np.max(a=x)
    ]
)
4

This can provide some customization:

list_statistics = ['count','mean','min',lambda x: np.percentile(x,q=25),'max',lambda x: np.percentile(x,q=75)]
cols_to_rename = {'<lambda_0>':'P25','<lambda_1>':'P75'}
df_out.groupby('Country').agg(list_statistics).rename(columns=cols_to_rename)
Jose Rondon
  • 370
  • 2
  • 6
  • 13
0

Multiple function can be called as below:

import pandas as pd

import numpy as np

import random

C = ['Ram', 'Ram', 'Shyam', 'Shyam', 'Mahima', 'Ram', 'Ram', 'Shyam', 'Shyam', 'Mahima']

A = [ random.randint(0,100) for i in range(10) ]

B = [ random.randint(0,100) for i in range(10) ]

df = pd.DataFrame({ 'field_A': A, 'field_B': B, 'field_C': C })

print(df)

d = df.groupby('field_C')['field_A'].describe()[['mean', 'count', '25%', '50%', '75%']]
print(d)

I was unable to call median in this, but able to work other functions.

Sangam Belose
  • 4,262
  • 8
  • 26
  • 48
Fakira
  • 9
  • 2
  • 1
    this calls all of them, but selects a few. this is bad for performance, which is the reason why you would use `agg` over describe. – Sebastian Wozny May 10 '19 at 18:15
  • @SebastianWozny may be you can update your comment on which solution do you recommend when dealing with big data – Areza Mar 30 '21 at 09:03
0

You can define simple lambda function

my_quant = lambda _q: lambda _data:_data.quantile(_q)

the use it with the required quantiles

df.aggregate([my_quant(0.1),my_quant(0.3),'median',my_quant(0.7)]

** If you want to keep the aggregations name without renaming the indexes you can create lambda creator and set the function names

def lambda_creator(func,name):
    func.__name__ = name
    return func

my_quant = lambda _q: lambda_creator(lambda _data: _data.quantile(_q),f'quantile_{_q}')
רועי שמש
  • 111
  • 1
  • 3
0

Get Q1 result using agg from dataframe groupby, recommend below also.

import numpy as np 
df.groupby(['col1','col2']).agg(quantile_columns = ('column_name', lambda x: np.percentile(x, q=25)))
johanna
  • 401
  • 4
  • 6
-1

In case you have a dataframe with several columns and only want the quantiles for one column:

df.groupby("AGGREGATE")['MY_COLUMN'].quantile([0, 0.25, 0.5, 0.75, 0.95, 1])

and in case you want a 1 level dataframe you can add:

df.groupby("AGGREGATE")['MY_COLUMN'].quantile([0, 0.25, 0.5, 0.75, 0.95, 1]).reset_index()
PV8
  • 5,799
  • 7
  • 43
  • 87