Pandas: Weighted median of grouped observations

Question

I have a dataframe that contains number of observations per group of income:

INCAGG
1         6.561681e+08
3         9.712955e+08
5         1.658043e+09
7         1.710781e+09
9         2.356979e+09

I would like to compute the median income group. What do I mean? Let's start with a simpler series:

INCAGG
1          6
3          9
5         16
7         17
9         23

It represents this set of numbers:

1 1 1 1 1 1
3 3 3 3 3 3 3 3 3
5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 
7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9

Which I can reorder to

1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 7 7 7 7 7
7 7 7 7 7 7 7 7 7 7 7 7 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9

which visually is what I mean - the median here would be 7.

It seems like you are ignoring the scientific notation...you see that 5 has a value in the right hand column almost twice as large as 3, right? — Brionius, Apr 16 '15 at 15:15
Well the median value of your incomes is the one at row 5 is what pandas tells me — EdChum, Apr 16 '15 at 15:16
Sorry folks, there might have been multiple sources of confusion; one of which: I didn't notice that groups 5 onwards were `e+09` instead of `e+08`. — FooBar, Apr 16 '15 at 15:46
Oh, second column are weights. You just want a weighted median. Maybe something like this http://stackoverflow.com/questions/20601872/numpy-or-scipy-to-calculate-weighted-median But in general, searching for "weighted" whatever should give a lot of results. This is really easy in some other stats packages (like stata), but I don't know if it is as easy in pandas. Maybe scipy or statsmodels has something? — JohnE, Apr 16 '15 at 15:55
The is a package for weighted quantiles apparently: ```wquantiles```, mentioned here: http://stackoverflow.com/questions/26102867/python-weighted-median-algorithm-with-pandas/26105703#26105703 — JohnE, Apr 16 '15 at 19:23

score 1 · Answer 1 · edited May 23 '17 at 12:22

After glancing at a numpy example here, I think cumsum() provides a good approach. Assuming your column of counts is called 'wt', here's a simple solution that will work most of the time (and see below for a more general solution):

df = df.sort('incagg')

df['tmp'] = df.wt.cumsum() < ( df.wt.sum() / 2. )

df['med_grp'] = (df.tmp==False) & (df.tmp.shift()==True)

The second code line above is dividing into rows above and below the median. The median observation will be in the first False group.

   incagg          wt    tmp med_grp
0       1   656168100   True   False
1       3   971295500   True   False
2       5  1658043000   True   False
3       7  1710781000  False    True
4       9  2356979000  False   False

df.ix[df.med_grp,'incagg']

3    7
Name: incagg, dtype: int64

This will work fine when the median is unique and often when it isn't. The problem can only occur if the median is non-unique AND it falls on the edge of a group. In this case (with 5 groups and weights in the millions/billions), it's really not a concern but nevertheless here's a more general solution:

df['tmp1']    = df.wt.cumsum() == (df.wt.sum() / 2.)
df['tmp2']    = df.wt.cumsum() < (df.wt.sum() / 2.)
df['med_grp'] = (df.tmp2==False) & (df.tmp2.shift()==True)
df['med_grp'] = df.med_grp | df.tmp1.shift()

   incagg  wt   tmp1   tmp2 med_grp
0       1   1  False   True   False
1       3   1  False   True   False
2       5   1   True  False    True
3       7   2  False  False    True
4       9   1  False  False   False

df.ix[df.med_grp,'incagg']
2    5
3    7

df.ix[df.med_grp,'incagg'].mean()
6.0

Alexander · Answer 2 · 2015-04-16T16:33:27.187

0

You can use chain from itertools. I used list comprehension to get a list of the aggregation group repeated the appropriate number of times, and then used chain to put it into a single list. Finally, I converted it to a Series and calculated the median:

from itertools import chain

df = pd.DataFrame([6, 9, 16, 17, 23], index=[1, 3, 5, 7, 9], columns=['counts'])

median = pd.Series([i for i in chain(*[[k] * v for k, v in zip(df.index, df.counts)])]).median()

>>> median
7.0

edited Apr 16 '15 at 16:33

answered Apr 16 '15 at 16:30

Alexander

105,104
32
201
196

1

Well, I have weights in the billions. – FooBar Apr 16 '15 at 17:03

Pandas: Weighted median of grouped observations

2 Answers2