3

I have a dataframe that contains number of observations per group of income:

INCAGG
1         6.561681e+08
3         9.712955e+08
5         1.658043e+09
7         1.710781e+09
9         2.356979e+09

I would like to compute the median income group. What do I mean? Let's start with a simpler series:

INCAGG
1          6
3          9
5         16
7         17
9         23

It represents this set of numbers:

1 1 1 1 1 1
3 3 3 3 3 3 3 3 3
5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 
7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9

Which I can reorder to

1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 7 7 7 7 7
7 7 7 7 7 7 7 7 7 7 7 7 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9

which visually is what I mean - the median here would be 7.

FooBar
  • 15,724
  • 19
  • 82
  • 171
  • Sorry you want the index of the median value of INCAGG? – EdChum Apr 16 '15 at 14:49
  • 1
    It seems like you are ignoring the scientific notation...you see that 5 has a value in the right hand column almost twice as large as 3, right? – Brionius Apr 16 '15 at 15:15
  • Well the median value of your incomes is the one at row 5 is what pandas tells me – EdChum Apr 16 '15 at 15:16
  • Sorry folks, there might have been multiple sources of confusion; one of which: I didn't notice that groups 5 onwards were `e+09` instead of `e+08`. – FooBar Apr 16 '15 at 15:46
  • Oh, second column are weights. You just want a weighted median. Maybe something like this http://stackoverflow.com/questions/20601872/numpy-or-scipy-to-calculate-weighted-median But in general, searching for "weighted" whatever should give a lot of results. This is really easy in some other stats packages (like stata), but I don't know if it is as easy in pandas. Maybe scipy or statsmodels has something? – JohnE Apr 16 '15 at 15:55
  • The is a package for weighted quantiles apparently: ```wquantiles```, mentioned here: http://stackoverflow.com/questions/26102867/python-weighted-median-algorithm-with-pandas/26105703#26105703 – JohnE Apr 16 '15 at 19:23

2 Answers2

1

After glancing at a numpy example here, I think cumsum() provides a good approach. Assuming your column of counts is called 'wt', here's a simple solution that will work most of the time (and see below for a more general solution):

df = df.sort('incagg')

df['tmp'] = df.wt.cumsum() < ( df.wt.sum() / 2. )

df['med_grp'] = (df.tmp==False) & (df.tmp.shift()==True)

The second code line above is dividing into rows above and below the median. The median observation will be in the first False group.

   incagg          wt    tmp med_grp
0       1   656168100   True   False
1       3   971295500   True   False
2       5  1658043000   True   False
3       7  1710781000  False    True
4       9  2356979000  False   False

df.ix[df.med_grp,'incagg']

3    7
Name: incagg, dtype: int64

This will work fine when the median is unique and often when it isn't. The problem can only occur if the median is non-unique AND it falls on the edge of a group. In this case (with 5 groups and weights in the millions/billions), it's really not a concern but nevertheless here's a more general solution:

df['tmp1']    = df.wt.cumsum() == (df.wt.sum() / 2.)
df['tmp2']    = df.wt.cumsum() < (df.wt.sum() / 2.)
df['med_grp'] = (df.tmp2==False) & (df.tmp2.shift()==True)
df['med_grp'] = df.med_grp | df.tmp1.shift()

   incagg  wt   tmp1   tmp2 med_grp
0       1   1  False   True   False
1       3   1  False   True   False
2       5   1   True  False    True
3       7   2  False  False    True
4       9   1  False  False   False

df.ix[df.med_grp,'incagg']
2    5
3    7

df.ix[df.med_grp,'incagg'].mean()
6.0
Community
  • 1
  • 1
JohnE
  • 29,156
  • 8
  • 79
  • 109
0

You can use chain from itertools. I used list comprehension to get a list of the aggregation group repeated the appropriate number of times, and then used chain to put it into a single list. Finally, I converted it to a Series and calculated the median:

from itertools import chain

df = pd.DataFrame([6, 9, 16, 17, 23], index=[1, 3, 5, 7, 9], columns=['counts'])

median = pd.Series([i for i in chain(*[[k] * v for k, v in zip(df.index, df.counts)])]).median()

>>> median
7.0
Alexander
  • 105,104
  • 32
  • 201
  • 196