2

I have a Series in Python and I'd like to fit a density to its histogram. Question: is there a slick way to use the values from np.histogram() to achieve this result? (see Update below)

My current problem is that the kde fit I perform has (seemingly) unwanted kinks, as depicted in the second plot below. I was hoping for a kde fit that is monotone decreasing based on a histogram, which is the first figure depicted. Below I've included my current code. Thanks in advance

import numpy as np
from matplotlib import pyplot as plt
from scipy.stats import gaussian_kde as kde

df[var].hist()
plt.show()  # shows the original histogram
density = kde(df[var])
xs = np.arange(0, df[var].max(), 0.1)
ys = density(xs)
plt.plot(xs, ys)  # a pdf with kinks

Alternatively, is there a slick way to use

count, div = np.histogram(df[var])

and then scale the count array to apply kde() to it?

original historgram

kde_fit

Update

Based on cel's comment below (should've been obvious, but I missed it!), I was implicitly under-binning in this case using the default params in pandas.DataFrame.hist(). In the updated plot I used

df[var].hist(bins=100)

I'll leave this post up in case others find it useful but won't mind if it gets taken down as 'too localized' etc.

enter image description here

Quetzalcoatl
  • 2,016
  • 4
  • 26
  • 36
  • 2
    It's a bad idea to use a histogram with very few bins to approximate the density. Try using more bins and you will probably see that your density looks more like the KDE estimate... – cel Mar 16 '15 at 17:48
  • thank you, feel free to post as answer if you'd like me to 'formally' accept – Quetzalcoatl Mar 16 '15 at 18:03
  • This was just a minor issue. I think it's a good way to understand the benefits and limitations of the histogram approach, though: the number of bins is critical and a too low/too high number of bins can give a distorted view on the data. Heads up, now comes the funny part of data analysis: Understanding what process could have created such a fancy distribution :) – cel Mar 16 '15 at 18:19

2 Answers2

2

If you increase the bandwidth using the bw_method parameter, then the kde will look smoother. This example comes from Justin Peel's answer; the code has been modified to take advantage of the bw_method:

import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import gaussian_kde

data = [1.5]*7 + [2.5]*2 + [3.5]*8 + [4.5]*3 + [5.5]*1 + [6.5]*8
density1 = gaussian_kde(data)
bandwidth = 1.5
density2 = gaussian_kde(data, bw_method=bandwidth)
xs = np.linspace(0,8,200)
plt.plot(xs,density1(xs), label='bw_method=None')
plt.plot(xs,density2(xs), label='bw_method={}'.format(bandwidth))
plt.legend(loc='best')
plt.show()

yields

enter image description here

Community
  • 1
  • 1
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • 1
    In my opinion in this case increasing the bandwidth is a very bad suggestion. See how many counts are in the peak. This is clearly not caused by random noise... Increasing the bandwidth is to some extend similar to displaying the data with a histogram with a very low amount of bins. You are basically cheating :) – cel Mar 16 '15 at 18:09
  • thank you, though in my specific case: smoothing on an under-binned sample would've further hidden my basic problem. see cel's comment above and my update. i'll certainly keep bandwidth in mind for future tasks – Quetzalcoatl Mar 16 '15 at 18:10
  • 1
    @cel: I take it you disapprove of [spherical cows](http://en.wikipedia.org/wiki/Spherical_cow) :) – unutbu Mar 16 '15 at 18:51
2

The problem was under-binning as mentioned by cel, see comments above. It was clarifying to set bins=100 in pd.DataFrame.histo() which defaults to bins=10.

See also: http://en.wikipedia.org/wiki/Histogram#Number_of_bins_and_width

Quetzalcoatl
  • 2,016
  • 4
  • 26
  • 36