4

Is there a way to plot densities using data that has observation weights?

I have a vector of observations x and a vector of integer weights y, such that y1 indicates how many observations we have of x1. That is, the density of

   x    y 
   1    2
   2    2
   2    3 

is equal to the density of 1, 1, 2, 2, 2, 2 ,2 (2x1, 5x2). As far as I understand it, matplotlib.pyplot.hist(weights=y) allow for observation weights when plotting the histogram. Is there any equivalent for computing and plotting the density?

The reason I want the package to be able to do this is that my data is very big, and I'm looking for a more efficient alternative.

Alternatively, I'm open to other packages.

FooBar
  • 15,724
  • 19
  • 82
  • 171
  • You only need to generate the densities from the observations? – Reut Sharabani Nov 12 '14 at 22:32
  • 1
    Sorry for the confusion, I want to plot the densities as in http://stackoverflow.com/questions/4150171/how-to-create-a-density-plot-in-matplotlib – FooBar Nov 12 '14 at 22:36
  • so as I understand it, you only need to create a list that you call a `histogram` and send it to one of the package suggested. Is your trouble creating that list from observations, or do you have a list and you're having trouble with the package? Or both? – Reut Sharabani Nov 12 '14 at 22:41
  • 1
    I say that I know functions that allow plotting histograms using observation weights. On the other hand, I'm not aware of functions that allow plotting densities using these weights. I bring the comparison given that densities are somewhat limit cases of histograms. I am not aware of being able to plot densities using histograms. – FooBar Nov 12 '14 at 22:44
  • Ahhh now I get it...! Sorry, can't help you too much there :) – Reut Sharabani Nov 12 '14 at 22:45
  • see the violin plot in mpl 1.4 and the KDE estimators from scipy. – tacaswell Nov 13 '14 at 15:16

1 Answers1

4

Statsmodels' kde univariate receives weights in its fit function. See the output of the following code.

import matplotlib.pyplot as plt
import statsmodels.api as sm
import pandas as pd

df = pd.DataFrame({'x':[1.,2.],'weight':[2,4]})
weighted = sm.nonparametric.KDEUnivariate(df.x)
noweight = sm.nonparametric.KDEUnivariate(df.x)
weighted.fit(fft=False, weights=df.weight)
noweight.fit()

f, (ax1, ax2) = plt.subplots(1, 2, sharey=True)
ax1.plot(noweight.support, noweight.density)
ax2.plot(weighted.support, weighted.density)

ax1.set_title('No Weight')
ax2.set_title('Weighted')

Output: No Weight vs Weighted Densities

Note: Your time concern regarding array creation will probably not be resolved with this. Because as noted in the source code:

If FFT is False, then a ‘number_of_obs’ x ‘gridsize’ intermediate array is created

tozCSS
  • 5,487
  • 2
  • 34
  • 31
  • Use `ax1.plot(noweight.support, noweight.density)` to have correct x-axis values. Also, note that the weights need to be a numpy array (or a column in pandas) or you will have the code complaining it can not do `weights.sum()` – fuyas May 30 '18 at 11:42