0

I would like to calculate the weighted median of each row of a pandas dataframe.

I found this nice function (https://stackoverflow.com/a/29677616/10588967), but I don't seem to be able to pass a 2d array.

def weighted_quantile(values, quantiles, sample_weight=None, values_sorted=False, old_style=False):
""" Very close to numpy.percentile, but supports weights.
NOTE: quantiles should be in [0, 1]!
:param values: numpy.array with data
:param quantiles: array-like with many quantiles needed
:param sample_weight: array-like of the same length as `array`
:param values_sorted: bool, if True, then will avoid sorting of initial array
:param old_style: if True, will correct output to be consistent with numpy.percentile.
:return: numpy.array with computed quantiles.
"""
values = numpy.array(values)
quantiles = numpy.array(quantiles)
if sample_weight is None:
    sample_weight = numpy.ones(len(values))
sample_weight = numpy.array(sample_weight)
assert numpy.all(quantiles >= 0) and numpy.all(quantiles <= 1), 'quantiles should be in [0, 1]'

if not values_sorted:
    sorter = numpy.argsort(values)
    values = values[sorter]
    sample_weight = sample_weight[sorter]

weighted_quantiles = numpy.cumsum(sample_weight) - 0.5 * sample_weight
if old_style:
    # To be convenient with numpy.percentile
    weighted_quantiles -= weighted_quantiles[0]
    weighted_quantiles /= weighted_quantiles[-1]
else:
    weighted_quantiles /= numpy.sum(sample_weight)
return numpy.interp(quantiles, weighted_quantiles, values)

Using the code from the link, the following works:

weighted_quantile([1, 2, 9, 3.2, 4], [0.0, 0.5, 1.])

However, this does not work:

values = numpy.random.randn(10,5)
quantiles = [0.0, 0.5, 1.]
sample_weight = numpy.random.randn(10,5)
weighted_quantile(values, quantiles, sample_weight)

I receive the following error:

weighted_quantiles = np.cumsum(sample_weight) - 0.5 * sample_weight

ValueError: operands could not be broadcast together with shapes (250,) (10,5,5)

Question Is it possible to apply this weighted quantile function in a vectorized manner on a dataframe, or I can only achieve this using .apply()?

Many thanks for your time!

MC_Doc
  • 11
  • 1

2 Answers2

0
 np.cumsum(sample_weight)

return a 1D list. So you would like to reshape it to (10,5,5) using

np.cumsum(sample_weight).reshape(10,5,5)
0

Try my code in the handy repo https://github.com/syrte/handy/blob/773a1500a9e10dd28eb0704fded94d6105a84374/stats.py#L239

I copy the docstring here, so you see what it can do. Please go to the link for the complete function (which is pretty long...)

def quantile(a, weights=None, q=None, nsig=None, origin='middle',
             axis=None, keepdims=False, sorted=False, nmin=0,
             nanas=None, shape='stats'):
    '''Compute the quantile of the data.
    Be careful when q is very small or many numbers repeat in a.
    Parameters
    ----------
    a : array_like
        Input array.
    weights : array_like, optional
        Weighting of a.
    q : float or float array in range of [0,1], optional
        Quantile to compute. One of `q` and `nsig` must be specified.
    nsig : float, optional
        Quantile in unit of standard deviation.
        Ignored when `q` is given.
    origin : ['middle'| 'high'| 'low'], optional
        Control how to interpret `nsig` to `q`.
    axis : int, optional
        Axis along which the quantiles are computed. The default is to
        compute the quantiles of the flattened array.
    sorted : bool
        If True, the input array is assumed to be in increasing order.
    nmin : int or None
        Return `nan` when the tail probability is less than `nmin/a.size`.
        Set `nmin` if you want to make result more reliable.
        - nmin = None will turn off the check.
        - nmin = 0 will return NaN for q not in [0, 1].
        - nmin >= 3 is recommended for statistical use.
        It is *not* well defined when `weights` is given.
    nanas : None, float, 'ignore'
        - None : do nothing. Note default sorting puts `nan` after `inf`.
        - float : `nan`s will be replaced by given value.
        - 'ignore' : `nan`s will be excluded before any calculation.
    shape : 'data' | 'stats'
        Put which axes first in the result:
            'data' - the shape of data
            'stats' - the shape of `q` or `nsig`
        Only works for case where axis is not None.
    Returns
    -------
    quantile : scalar or ndarray
        The first axes of the result corresponds to the quantiles,
        the rest are the axes that remain after the reduction of `a`.
    See Also
    --------
    numpy.percentile
    conflevel
    Examples
    --------
    >>> np.random.seed(0)
    >>> x = np.random.randn(3, 100)
    >>> quantile(x, q=0.5)
    0.024654858649703838
    >>> quantile(x, nsig=0)
    0.024654858649703838
    >>> quantile(x, nsig=1)
    1.0161711040272021
    >>> quantile(x, nsig=[0, 1])
    array([ 0.02465486,  1.0161711 ])
    >>> quantile(np.abs(x), nsig=1, origin='low')
    1.024490097937702
    >>> quantile(-np.abs(x), nsig=1, origin='high')
    -1.0244900979377023
    >>> quantile(x, q=0.5, axis=1)
    array([ 0.09409612,  0.02465486, -0.07535884])
    >>> quantile(x, q=0.5, axis=1).shape
    (3,)
    >>> quantile(x, q=0.5, axis=1, keepdims=True).shape
    (3, 1)
    >>> quantile(x, q=[0.2, 0.8], axis=1).shape
    (2, 3)
    >>> quantile(x, q=[0.2, 0.8], axis=1, shape='stats').shape
    (3, 2)
    '''
Syrtis Major
  • 3,791
  • 1
  • 30
  • 40