Pandas: Calculate Median of Group over Columns

Question

Given the following data frame:

import pandas as pd

df = pd.DataFrame({'COL1': ['A', 'A','A','A','B','B'], 
                   'COL2' : ['AA','AA','BB','BB','BB','BB'],
                   'COL3' : [2,3,4,5,4,2],
                   'COL4' : [0,1,2,3,4,2]})
df
    COL1    COL2    COL3    COL4
0    A       AA      2       0
1    A       AA      3       1
2    A       BB      4       2
3    A       BB      5       3
4    B       BB      4       4
5    B       BB      2       2

I would like, as efficiently as possible (i.e. via groupby and lambda x or better), to find the median of columns 3 and 4 for each distinct group of columns 1 and 2.

The desired result is as follows:

    COL1    COL2    COL3    COL4  MEDIAN
0    A       AA      2       0    1.5
1    A       AA      3       1    1.5
2    A       BB      4       2    3.5
3    A       BB      5       3    3.5
4    B       BB      4       4    3
5    B       BB      2       2    3

Thanks in advance!

So far, just this: df['MEDIAN']=df.groupby(['COL1','COL2'])[['COL3','COL4']].transform(lambda x: x.median()) — Dance Party, Feb 08 '16 at 03:40

score 9 · Accepted Answer · answered Feb 08 '16 at 03:40

9

You already had the idea -- groupby COL1 and COL2 and calculate median.

m = df.groupby(['COL1', 'COL2'])[['COL3','COL4']].apply(np.median)
m.name = 'MEDIAN'

print df.join(m, on=['COL1', 'COL2'])

  COL1 COL2  COL3  COL4  MEDIAN
0    A   AA     2     0     1.5
1    A   AA     3     1     1.5
2    A   BB     4     2     3.5
3    A   BB     5     3     3.5
4    B   BB     4     4     3.0
5    B   BB     2     2     3.0

answered Feb 08 '16 at 03:40

Happy001

6,103
2
23
16

Thanks! What if I have some NaN values? How can I get it to ignore those without resulting in NaN results (as is the case with current NaN values with your solution applied)? – Dance Party Feb 08 '16 at 04:02
1

use `np.nanmedian` instead of `np.median` – Happy001 Feb 08 '16 at 13:46
@Happy001 `df.groupby(['COL1', 'COL2'])[['COL3','COL4']].median()` would work as well. – Qaswed Sep 24 '19 at 10:53

score 1 · Answer 2 · answered Feb 04 '21 at 10:09

1

df.groupby(['COL1', 'COL2']).median()[['COL3','COL4']]

answered Feb 04 '21 at 10:09

donDrey

41
4

Pandas: Calculate Median of Group over Columns

2 Answers2

Linked