23

I have a dataframe that looks like this:

Out[14]:
    impwealth  indweight
16     180000     34.200
21     384000     37.800
26     342000     39.715
30    1154000     44.375
31     421300     44.375
32    1210000     45.295
33    1062500     45.295
34    1878000     46.653
35     876000     46.653
36     925000     53.476

I want to calculate the weighted median of the column impwealth using the frequency weights in indweight. My pseudo code looks like this:

# Sort `impwealth` in ascending order 
df.sort('impwealth', 'inplace'=True)

# Find the 50th percentile weight, P
P = df['indweight'].sum() * (.5)

# Search for the first occurrence of `impweight` that is greater than P 
i = df.loc[df['indweight'] > P, 'indweight'].last_valid_index()

# The value of `impwealth` associated with this index will be the weighted median
w_median = df.ix[i, 'impwealth']

This method seems clunky, and I'm not sure it's correct. I didn't find a built in way to do this in pandas reference. What is the best way to go about finding weighted median?

svenkatesh
  • 1,152
  • 2
  • 10
  • 25
  • Are you sure your pseudo code is correct? `df['indweight'].sum() * (.5)` will give a value of ~`219` which none of your `indweight` values exceed. Calling `df['indweight'].median()` gives 44.835 and `mean()` gives 43.783 – EdChum Sep 29 '14 at 15:11
  • I think so.. `df['indweight'].sum() * (.5)` should calculate the number of observations that fall under the 50th percentile in the data, since `indweight` is a frequency weight. So it makes sense that the mean and median of `indweight` exceed its sum. – svenkatesh Sep 29 '14 at 15:20
  • @svenkatesh, you need to use a ``.cumsum()`` of ``indweight``, not ``indweight`` itself. See my answer below, perhaps. – prooffreader Feb 11 '16 at 20:20

7 Answers7

27

If you want to do this in pure pandas, here's a way. It does not interpolate either. (@svenkatesh, you were missing the cumulative sum in your pseudocode)

df.sort_values('impwealth', inplace=True)
cumsum = df.indweight.cumsum()
cutoff = df.indweight.sum() / 2.0
median = df.impwealth[cumsum >= cutoff].iloc[0]

This gives a median of 925000.

prooffreader
  • 2,333
  • 4
  • 21
  • 32
  • the sorting should be done on the ```indweight``` column... so first line should be ```df.sort_values('indweight', inplace=True)``` – Manuel F Aug 12 '22 at 11:22
  • @ManuelF, the original code is correct. Data has to be sorted, not weights. – xkudsraw Oct 19 '22 at 14:18
10

Have you tried the wquantiles package? I had never used it before, but it has a weighted median function that seems to give at least a reasonable answer (you'll probably want to double check that it's using the approach you expect).

In [12]: import weighted

In [13]: weighted.median(df['impwealth'], df['indweight'])
Out[13]: 914662.0859091772
Max Ghenis
  • 14,783
  • 16
  • 84
  • 132
chrisb
  • 49,833
  • 8
  • 70
  • 70
  • 2
    Personally, I'm a little wary of installing a package where a few lines of code will do, but if you need interpolated weighted medians, perhaps this is the best approach. – prooffreader Feb 11 '16 at 20:21
8

This function generalizes proofreader's solution:

def weighted_median(df, val, weight):
    df_sorted = df.sort_values(val)
    cumsum = df_sorted[weight].cumsum()
    cutoff = df_sorted[weight].sum() / 2.
    return df_sorted[cumsum >= cutoff][val].iloc[0]

In this example it'd be weighted_median(df, 'impwealth', 'indweight').

kadee
  • 8,067
  • 1
  • 39
  • 31
Max Ghenis
  • 14,783
  • 16
  • 84
  • 132
5

You can use this solution to Weighted percentile using numpy:

def weighted_quantile(values, quantiles, sample_weight=None, 
                      values_sorted=False, old_style=False):
    """ Very close to numpy.percentile, but supports weights.
    NOTE: quantiles should be in [0, 1]!
    :param values: numpy.array with data
    :param quantiles: array-like with many quantiles needed
    :param sample_weight: array-like of the same length as `array`
    :param values_sorted: bool, if True, then will avoid sorting of
        initial array
    :param old_style: if True, will correct output to be consistent
        with numpy.percentile.
    :return: numpy.array with computed quantiles.
    """
    values = np.array(values)
    quantiles = np.array(quantiles)
    if sample_weight is None:
        sample_weight = np.ones(len(values))
    sample_weight = np.array(sample_weight)
    assert np.all(quantiles >= 0) and np.all(quantiles <= 1), \
        'quantiles should be in [0, 1]'

    if not values_sorted:
        sorter = np.argsort(values)
        values = values[sorter]
        sample_weight = sample_weight[sorter]

    weighted_quantiles = np.cumsum(sample_weight) - 0.5 * sample_weight
    if old_style:
        # To be convenient with numpy.percentile
        weighted_quantiles -= weighted_quantiles[0]
        weighted_quantiles /= weighted_quantiles[-1]
    else:
        weighted_quantiles /= np.sum(sample_weight)
    return np.interp(quantiles, weighted_quantiles, values)

Call as weighted_quantile(df.impwealth, quantiles=0.5, df.indweight).

Max Ghenis
  • 14,783
  • 16
  • 84
  • 132
  • Do you mind explaining what the following line is doing: weighted_quantiles = np.cumsum(sample_weight) - 0.5 * sample_weight – Jake Jan 12 '20 at 23:38
  • Suppose we have values `[3, 10, 12]` with associated weights `[0.2, 0.5, 0.3]` (sorted from prior lines). `np.cumsum` would yield `[0.2, 0.7, 1.0]`, but those are actually the right edges of the associated quantiles. To center them, we subtract half the weight from each bucket, yielding `[0.1, 0.45, 0.85]`. That's then what we interpolate from to get the weighted quantiles. – Max Ghenis Jan 14 '20 at 05:47
  • Thanks so much. One more question (sorry if this is dumb) why do we want to center the quantiles? – Jake Jan 14 '20 at 07:34
  • Suppose you only have two values `(3, 4)`; should their respective quantiles be `(0, 0.5)`, `(0.5, 1)`, `(0.25, 0.75)`, or `(0, 1)`? The first two are problematic because they're asymmetric. The third is what this function does by default, and the fourth is what `numpy.percentile` does, which can be activated here with the `old_style=True` arg. An advantage to the default is that, if sampling from the quantiles, you have a nonzero chance of getting observed values, e.g. it'd be `3` for quantiles 0-0.25. However, that trapezoidal distribution can be less intuitive than `old_style` flat dist. – Max Ghenis Jan 18 '20 at 16:14
4

You can also use this function that I've written for the same purpose.

Note: weighted uses interpolation at the end to select the 0.5 quantile (you can look at the code yourself)

My written function just returns the one bounding 0.5 weight.

import numpy as np

def weighted_median(values, weights):
    ''' compute the weighted median of values list. The 
weighted median is computed as follows:
    1- sort both lists (values and weights) based on values.
    2- select the 0.5 point from the weights and return the corresponding values as results
    e.g. values = [1, 3, 0] and weights=[0.1, 0.3, 0.6] assuming weights are probabilities.
    sorted values = [0, 1, 3] and corresponding sorted weights = [0.6,     0.1, 0.3] the 0.5 point on
    weight corresponds to the first item which is 0. so the weighted     median is 0.'''

    #convert the weights into probabilities
    sum_weights = sum(weights)
    weights = np.array([(w*1.0)/sum_weights for w in weights])
    #sort values and weights based on values
    values = np.array(values)
    sorted_indices = np.argsort(values)
    values_sorted  = values[sorted_indices]
    weights_sorted = weights[sorted_indices]
    #select the median point
    it = np.nditer(weights_sorted, flags=['f_index'])
    accumulative_probability = 0
    median_index = -1
    while not it.finished:
        accumulative_probability += it[0]
        if accumulative_probability > 0.5:
            median_index = it.index
            return values_sorted[median_index]
        elif accumulative_probability == 0.5:
            median_index = it.index
            it.iternext()
            next_median_index = it.index
            return np.mean(values_sorted[[median_index, next_median_index]])
        it.iternext()

    return values_sorted[median_index]
#compare weighted_median function and np.median
print weighted_median([1, 3, 0, 7], [2,3,3,9])
print np.median([1,1,0,0,0,3,3,3,7,7,7,7,7,7,7,7,7])
Ash
  • 3,428
  • 1
  • 34
  • 44
  • the weighted median function is very similar to the accepted answer if you look at the code but doesn't interpolate at the end. – Ash Oct 03 '15 at 10:23
2

You can also calculate a weighted median using the robustats library:

import numpy as np
import robustats # pip install robustats


# Weighted Median
x = np.array([1.1, 5.3, 3.7, 2.1, 7.0, 9.9])
weights = np.array([1.1, 0.4, 2.1, 3.5, 1.2, 0.8])

weighted_median = robustats.weighted_median(x, weights)

print("The weighted median is {}".format(weighted_median))
Matheus Torquato
  • 1,293
  • 18
  • 25
0

There is a weightedstats package, available through both conda and pip, which does weighted_median.

Assuming you're using conda, from a terminal (Mac/Linux) or Anaconda prompt (Win):

conda activate YOURENVIRONMENT
conda install -c conda-forge -y weightedstats

(the -y means "don't ask me for confirmation of changes, just do it")

Then in your Python code:

import pandas as pd
import weightedstats as ws

df = pd.read_csv('/your/data/file.csv')
ws.weighted_median(df['values_col'], df['weights_col'])

I'm not sure whether it will work in all cases, but I just did a comparison for some simple data against the function weightedMedian() from the R package matrixStats, and I got the same result with both.


P.S.: Incidentally, with weightedstats you can calculate the weighted_mean() as well, although that's also possible with NumPy:

np.average(df['values_col'], weights=df['weights_col'])
Vicks
  • 141
  • 2
  • 8