15

Update: Weighted samples are now supported by scipy.stats.gaussian_kde. See here and here for details.

It is currently not possible to use scipy.stats.gaussian_kde to estimate the density of a random variable based on weighted samples. What methods are available to estimate densities of continuous random variables based on weighted samples?

Till Hoffmann
  • 9,479
  • 6
  • 46
  • 64
  • 2
    As of scipy version 1.2, there appears to be a 'weights' option in sp.stats.gaussian_kde which seems to do what you are after now. – wmsmith Feb 04 '19 at 18:46

3 Answers3

26

Neither sklearn.neighbors.KernelDensity nor statsmodels.nonparametric seem to support weighted samples. I modified scipy.stats.gaussian_kde to allow for heterogeneous sampling weights and thought the results might be useful for others. An example is shown below.

example

An ipython notebook can be found here: http://nbviewer.ipython.org/gist/tillahoffmann/f844bce2ec264c1c8cb5

Implementation details

The weighted arithmetic mean is

weighted arithmetic mean

The unbiased data covariance matrix is then given by unbiased covariance matrix

The bandwidth can be chosen by scott or silverman rules as in scipy. However, the number of samples used to calculate the bandwidth is Kish's approximation for the effective sample size.

Till Hoffmann
  • 9,479
  • 6
  • 46
  • 64
  • 2
    Have you considered asking the `scipy` devs to integrate your code into `scipy` or `statsmodels`? – cel Dec 23 '14 at 17:02
  • 3
    Yes, but I have not yet gotten around to implementing resampling and integration. I will issue a pull request once that's done. – Till Hoffmann Dec 23 '14 at 17:13
  • I've been working on a similar problem, but using my own framework rather than modifying scipy. I hadn't thought to use Kish's approximation. Do you think it's the best bandwidth estimator? It reweights every point in the dataset with the same effective sample size. I wonder whether a variable bandwidth might make more sense. – Gabriel Apr 24 '15 at 11:11
  • to resample would it be correct to replace in the resample function randint with np.random.choice(xrange(0, self.n), p=self.weights/np.sum(self.weights), size=size) ? – stefano Jul 08 '16 at 18:02
  • Yes, that should be fine. However, you may be better of using `np.arange` rather than `xrange` because numpy often converts to arrays internally. But testing would be better than my speculation. – Till Hoffmann Jul 11 '16 at 14:58
  • Very nice code, it works quite well for me. I did find a bug when setting `bw_method` with a scalar: `NameError: name 'string_types' is not defined` – xyzzyqed Jul 06 '18 at 09:21
  • @xyzzyqed, yes, there is a `from six import string_types` missing. – Till Hoffmann Jul 06 '18 at 10:21
  • Note that as @Ramon noted below, `statsmodels.nonparametric.kde.KDEUnivariate` supports `weights` (now). – Lei Sep 19 '18 at 04:42
  • Now sklearn.neighbors.KernelDensity does support weighted samples, as per this [pull request](https://github.com/scikit-learn/scikit-learn/pull/10803) and the [manual](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KernelDensity.html#sklearn.neighbors.KernelDensity.fit). If this is true, should this comment be more visible as an edit on the answer? – toliveira Nov 29 '19 at 15:12
  • Incase anyone check this now, there is an implimentation on `scipy` now. Refer: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gaussian_kde.html – S.Perera Sep 21 '21 at 18:37
2

For univariate distributions you can use KDEUnivariate from statsmodels. It is not well documented, but the fit methods accepts a weights argument. Then you cannot use FFT. Here is an example:

import matplotlib.pyplot as plt
from statsmodels.nonparametric.kde import KDEUnivariate

kde1= KDEUnivariate(np.array([10.,10.,10.,5.]))
kde1.fit(bw=0.5)
plt.plot(kde1.support, [kde1.evaluate(xi) for xi in kde1.support],'x-')

kde1= KDEUnivariate(np.array([10.,5.]))
kde1.fit(weights=np.array([3.,1.]), 
         bw=0.5,
         fft=False)
plt.plot(kde1.support, [kde1.evaluate(xi) for xi in kde1.support], 'o-')

which produces this figure: enter image description here

Ramon Crehuet
  • 3,679
  • 1
  • 22
  • 37
1

Check out the packages PyQT-Fit and statistics for Python. They seem to have kernel density estimation with weighted observations.

dikdirk
  • 11
  • 3