116

numpy.average() has a weights option, but numpy.std() does not. Does anyone have suggestions for a workaround?

Asclepius
  • 57,944
  • 17
  • 167
  • 143
YGA
  • 9,546
  • 15
  • 47
  • 50
  • 2
    Btw, calculation of weighted std dev is actually a rather complex subject -- there's more than one way to do it. See here for a great discussion: https://www.stata.com/support/faqs/statistics/weights-and-summary-statistics/ – JohnE Nov 18 '17 at 17:09
  • http://www.ccgalberta.com/pygeostat/statistics.html#weighted-statistics – e271p314 Dec 23 '20 at 22:57

7 Answers7

183

How about the following short "manual calculation"?

def weighted_avg_and_std(values, weights):
    """
    Return the weighted average and standard deviation.

    values, weights -- NumPy ndarrays with the same shape.
    """
    average = numpy.average(values, weights=weights)
    # Fast and numerically precise:
    variance = numpy.average((values-average)**2, weights=weights)
    return (average, math.sqrt(variance))

    
Eric O. Lebigot
  • 91,433
  • 48
  • 218
  • 260
  • 6
    Why not use `numpy.average` again for the variance? – user2357112 Aug 07 '13 at 01:26
  • 5
    Just wanted to point out that this will give the biased variance. For small sample sizes, you may want to re-scale the variance (before sqrt) to get the unbiased variance. See https://en.wikipedia.org/wiki/Weighted_variance#Weighted_sample_variance – Corey Mar 07 '14 at 05:17
  • 1
    Yeah, the unbiased variance estimator would be slightly different. This answer gives the standard deviation, since the question asks for a weighted version of `numpy.std()`. – Eric O. Lebigot Sep 12 '14 at 09:58
  • 1
    thx for this solution... but why do you use `math.sqrt` instead of `np.sqrt` in the end? – raphael Oct 10 '18 at 08:58
  • 2
    `np.sqrt()` would work, but because `variance` is a simple (Numpy) float (and not a NumPy array), `math.sqrt()` is more explicit and appropriate (and therefore in general faster, if this matters). – Eric O. Lebigot Oct 11 '18 at 10:29
  • 1
    I wanted to object to @EricO.Lebigot, but the difference is indeed staggering: `%timeit math.sqrt(1.5)` 61.6 ns ± 0.661 ns per loop, `%timeit np.sqrt(1.5)` 554 ns ± 14.5 ns per loop. `math.sqrt` is almost a factor 10 faster (Python 3.10.9, numpy v1.23.3). – olq_plo Mar 12 '23 at 08:45
54

There is a class in statsmodels that makes it easy to calculate weighted statistics: statsmodels.stats.weightstats.DescrStatsW.

Assuming this dataset and weights:

import numpy as np
from statsmodels.stats.weightstats import DescrStatsW

array = np.array([1,2,1,2,1,2,1,3])
weights = np.ones_like(array)
weights[3] = 100

You initialize the class (note that you have to pass in the correction factor, the delta degrees of freedom at this point):

weighted_stats = DescrStatsW(array, weights=weights, ddof=0)

Then you can calculate:

  • .mean the weighted mean:

    >>> weighted_stats.mean      
    1.97196261682243
    
  • .std the weighted standard deviation:

    >>> weighted_stats.std       
    0.21434289609681711
    
  • .var the weighted variance:

    >>> weighted_stats.var       
    0.045942877107170932
    
  • .std_mean the standard error of weighted mean:

    >>> weighted_stats.std_mean  
    0.020818822467555047
    

    Just in case you're interested in the relation between the standard error and the standard deviation: The standard error is (for ddof == 0) calculated as the weighted standard deviation divided by the square root of the sum of the weights minus 1 (corresponding source for statsmodels version 0.9 on GitHub):

    standard_error = standard_deviation / sqrt(sum(weights) - 1)
    
MSeifert
  • 145,886
  • 38
  • 333
  • 352
  • To use this approach to easily calculate the weighted coefficient of variation, see [this answer](https://stackoverflow.com/a/53748541/832230). – Asclepius Dec 12 '18 at 17:42
40

Here's one more option:

np.sqrt(np.cov(values, aweights=weights))
Leo
  • 1,135
  • 11
  • 14
  • This is the shortest and most elegant of the given solutions – Itamar Mushkin Dec 13 '22 at 11:39
  • 1
    This is indeed clean. It is worth noting, though, that this solution is restricted to 1D weights and values and is therefore less general than the accepted answer. The two calculations agree, on 1D data, if `np.cov` is called with `ddof=0`. – Eric O. Lebigot Mar 12 '23 at 09:26
6

There doesn't appear to be such a function in numpy/scipy yet, but there is a ticket proposing this added functionality. Included there you will find Statistics.py which implements weighted standard deviations.

Mad Physicist
  • 107,652
  • 25
  • 181
  • 264
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
2

There is a very good example proposed by gaborous:

import pandas as pd
import numpy as np
# X is the dataset, as a Pandas' DataFrame
# Compute the weighted sample mean (fast, efficient and precise)
mean = mean = np.ma.average(X, axis=0, weights=weights) 

# Convert to a Pandas' Series (it's just aesthetic and more 
# ergonomic; no difference in computed values)
mean = pd.Series(mean, index=list(X.keys())) 
xm = X-mean # xm = X diff to mean
# fill NaN with 0 
# a variance of 0 is just void, but at least it keeps the other
# covariance's values computed correctly))
xm = xm.fillna(0) 
# Compute the unbiased weighted sample covariance
sigma2 = 1./(w.sum()-1) * xm.mul(w, axis=0).T.dot(xm); 

Correct equation for weighted unbiased sample covariance, URL (version: 2016-06-28)

Paidoo
  • 379
  • 3
  • 13
abah
  • 21
  • 2
1

A follow-up to "sample" or "unbiased" standard deviation in the "frequency weights" sense since "weighted sample standard deviation python" Google search leads to this post:

def frequency_sample_std_dev(X, n):
    """
    Sample standard deviation for X and n,
    where X[i] is the quantity each person in group i has,
    and n[i] is the number of people in group i.
    See Equation 6.4 of:
    Montgomery, Douglas, C. and George C. Runger. Applied Statistics 
     and Probability for Engineers, Enhanced eText. Available from: 
      WileyPLUS, (7th Edition). Wiley Global Education US, 2018.
    """
    n_groups = len(n)
    n_people = sum(n)
    lhs_numerator = sum([ni*Xi**2 for Xi, ni in zip(X, n)])
    rhs_numerator = sum([Xi*ni for Xi, ni in zip(X,n)])**2/n_people
    denominator = n_people-1
    var = (lhs_numerator - rhs_numerator) / denominator
    std = sqrt(var)
    return std

Or modifying the answer by @Eric as follows:

def weighted_sample_avg_std(values, weights):
    """
    Return the weighted average and weighted sample standard deviation.

    values, weights -- Numpy ndarrays with the same shape.
    
    Assumes that weights contains only integers (e.g. how many samples in each group).
    
    See also https://en.wikipedia.org/wiki/Weighted_arithmetic_mean#Frequency_weights
    """
    average = np.average(values, weights=weights)
    variance = np.average((values-average)**2, weights=weights)
    variance = variance*sum(weights)/(sum(weights)-1)
    return (average, sqrt(variance))

print(weighted_sample_avg_std(X, n))
Sterling
  • 344
  • 3
  • 16
  • 1
    Thanks for this nice answer! However, for your second function `weighted_sample_avg_std()`, on the third line where you have the second part of the variance equation, the variance is not supposed to be multiplied by a ratio of the sums but by a ratio of the number of non-zeros weights (https://www.itl.nist.gov/div898/software/dataplot/refman2/ch2/weightsd.pdf). – DouglasCoenen Oct 15 '21 at 12:24
  • Hmm.. that's a good point. Would you mind suggesting an edit? I looked into this previously (but after you made the comment), but the actual change wasn't obvious to me. – Sterling Dec 09 '21 at 22:49
0

I was just searching for an API equivalent of the numpy np.std function that also allows the axis parameter to be set:

(I just tested it with two dimensions, so feel free for improvements if something is incorrect.)

def std(values, weights=None, axis=None):
    """
    Return the weighted standard deviation.
    axis -- the axis for std calculation
    values, weights -- Numpy ndarrays with the same shape on the according axis.
    """
    average = np.expand_dims(np.average(values, weights=weights, axis=axis), axis=axis)
    # Fast and numerically precise:
    variance = np.average((values-average)**2, weights=weights, axis=axis)
    return np.sqrt(variance)

Thanks to Eric O Lebigot for the original answer.

Jan
  • 2,025
  • 17
  • 27