Python: Overlap between two functions (PDF of kde and normal)

Question

Short summary: Im trying to figure out how to calculate overlap between two functions. One is a gaussian, the other is a kernel density, based on data. Then, I would like to make a small algorithm that choose the mean and variance for the gaussian, which maximises overlap

Firstly, needed imports:

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats.kde import gaussian_kde
import scipy

I have some data which is approximately normal (somewhat heavy right-tail). I calculate the kernel density, cdf and pdf of this data (in this example, the data is drawn from a uniform, as i can't supply the real data) like this:

def survivalFunction():

    data = np.random.normal(7,1,100) #Random data 

    p = sns.kdeplot(data, shade=False, lw = 3)
    x,y = p.get_lines()[0].get_data()
    cdf = scipy.integrate.cumtrapz(y, x, initial=0)

    plt.hist(data,50,normed = 1,facecolor='b',alpha = 0.3)

Then i have another function, which is just a simple gaussian:

def surpriseFunction(mu,variance):

    hStates = np.linspace(0,20,100)
    sigma = math.sqrt(variance)

    plt.plot(hStates,scipy.stats.norm.pdf(hStates, mu, sigma))

calling the functions

surpriseFunction(5,1)
survivalFunction()

gives this plot

enter image description here

As you might have noticed, exchanging different values of mu, moves around the normal to overlap more or less with the kernel estimation. Now, my question is twofold:

1) How to I calculate the overlap between the two functions?

2) How would I go about making a small algorithm, that choose a mean and variance for the gaussian, in such a way, as to maximise this overlap?

will · Accepted Answer · 2015-03-25T15:06:41.093

Okay, so i did a fairly major reshuffle, i think it separates out the major parts and will make it easy to make modular / in various functions. The orriginal code for the previous answer i gave is here.

Here's the new stuff, hopefully it's pretty self explanatory.

# Setup our various global variables
population_mean = 7
population_std_dev = 1
samples = 100
histogram_bins = 50

# And setup our figure.
from matplotlib import pyplot
fig = pyplot.figure()
ax = fig.add_subplot(1,1,1)


from numpy.random import normal  
hist_data = normal(population_mean, population_std_dev, samples)
ax.hist(hist_data, bins=histogram_bins, normed=True, color="blue", alpha=0.3)


from statsmodels.nonparametric.kde import KDEUnivariate
kde = KDEUnivariate(hist_data)
kde.fit()

#kde.supprt and kde.density hold the x and y values of the KDE fit.
ax.plot(kde.support, kde.density, color="red", lw=4)


#Gaussian function - though you can replace this with something of your choosing later.
from numpy import sqrt, exp, pi
r2pi = sqrt(2*pi)
def gaussian(x, mu, sigma):
  return exp(-0.5 * ( (x-mu) / sigma)**2) / (sigma * r2pi)

#interpolation of KDE to produce a function.
from scipy.interpolate import interp1d
kde_func = interp1d(kde.support, kde.density, kind="cubic", fill_value=0)

What you want to do is just standard curve fitting - there are numerous ways to do it, and you say you want to fit the curve by maximizing the overlap of the two functions (why?). the curve_fir scipy routine is a least squares fit, which is trying to minimize the difference between the two functions - the difference is subtle: maximizing the overlap does not punish the fitting function for being larger than the data, whereas curve_fit does.

I've included solutions using both techniques, as well as profiled them:

#We need to *maximise* the overlap integral
from scipy.integrate import quad as integrate
def overlap(func1, func2, limits, func1_args=[], func2_args=[]):

  def product_func(x):
    return min(func1(x, *func1_args),func2(x, *func2_args))

  return integrate(product_func, *limits)[0] # we only care about the absolute result for now.

limits = hist_data.min(), hist_data.max()
def gaussian_overlap(args):
  mu, sigma = args
  return -overlap(kde_func, gaussian, limits, func2_args=[mu, sigma])

And now the two different methods, the overlap metric:

import cProfile, pstats, StringIO
pr1 = cProfile.Profile()
pr1.enable()

from scipy.optimize import fmin_powell as minimize
mu_overlap_fit, sigma_overlap_fit = minimize(gaussian_overlap, (population_mean, population_std_dev))

pr1.disable()
s = StringIO.StringIO()
sortby = 'cumulative'
ps = pstats.Stats(pr1, stream=s).sort_stats(sortby)
ps.print_stats()
print s.getvalue()



   3122462 function calls in 6.298 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    6.298    6.298 C:\Python27\lib\site-packages\scipy\optimize\optimize.py:2120(fmin_powell)
        1    0.000    0.000    6.298    6.298 C:\Python27\lib\site-packages\scipy\optimize\optimize.py:2237(_minimize_powell)
       57    0.000    0.000    6.296    0.110 C:\Python27\lib\site-packages\scipy\optimize\optimize.py:279(function_wrapper)
       57    0.000    0.000    6.296    0.110 C:\Users\Will\Documents\Python_scripts\hist_fit.py:47(gaussian_overlap)
       57    0.000    0.000    6.296    0.110 C:\Users\Will\Documents\Python_scripts\hist_fit.py:39(overlap)
       57    0.000    0.000    6.296    0.110 C:\Python27\lib\site-packages\scipy\integrate\quadpack.py:42(quad)
       57    0.000    0.000    6.295    0.110 C:\Python27\lib\site-packages\scipy\integrate\quadpack.py:327(_quad)
       57    0.069    0.001    6.295    0.110 {scipy.integrate._quadpack._qagse}
    66423    0.154    0.000    6.226    0.000 C:\Users\Will\Documents\Python_scripts\hist_fit.py:41(product_func)
        4    0.000    0.000    6.167    1.542 C:\Python27\lib\site-packages\scipy\optimize\optimize.py:2107(_linesearch_powell)
        4    0.000    0.000    6.166    1.542 C:\Python27\lib\site-packages\scipy\optimize\optimize.py:1830(brent)
        4    0.000    0.000    6.166    1.542 C:\Python27\lib\site-packages\scipy\optimize\optimize.py:1887(_minimize_scalar_brent)
        4    0.001    0.000    6.166    1.542 C:\Python27\lib\site-packages\scipy\optimize\optimize.py:1717(optimize)

and the scipy method curve_fit:

pr2 = cProfile.Profile()
pr2.enable()

from scipy.optimize import curve_fit
(mu_curve_fit, sigma_curve_fit), _ = curve_fit(gaussian, kde.support, kde.density, p0=(population_mean, population_std_dev))

pr2.disable()
s = StringIO.StringIO()
sortby = 'cumulative'
ps = pstats.Stats(pr2, stream=s).sort_stats(sortby)
ps.print_stats()
print s.getvalue()




   122 function calls in 0.001 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.001    0.001 C:\Python27\lib\site-packages\scipy\optimize\minpack.py:452(curve_fit)
        1    0.000    0.000    0.001    0.001 C:\Python27\lib\site-packages\scipy\optimize\minpack.py:256(leastsq)
        1    0.000    0.000    0.001    0.001 {scipy.optimize._minpack._lmdif}
       19    0.000    0.000    0.001    0.000 C:\Python27\lib\site-packages\scipy\optimize\minpack.py:444(_general_function)
       19    0.000    0.000    0.000    0.000 C:\Users\Will\Documents\Python_scripts\hist_fit.py:29(gaussian)
        1    0.000    0.000    0.000    0.000 C:\Python27\lib\site-packages\scipy\linalg\basic.py:314(inv)
        1    0.000    0.000    0.000    0.000 C:\Python27\lib\site-packages\scipy\optimize\minpack.py:18(_check_func)

You can see the curve_fit method is much faster, and the results:

from numpy import linspace
xs = linspace(-1, 1, num=1000) * sigma_overlap_fit * 6 + mu_overlap_fit
ax.plot(xs, gaussian(xs, mu_overlap_fit, sigma_overlap_fit), color="orange", lw=2)

xs = linspace(-1, 1, num=1000) * sigma_curve_fit * 6 + mu_curve_fit
ax.plot(xs, gaussian(xs, mu_curve_fit, sigma_curve_fit), color="purple", lw=2)

pyplot.show()

enter image description here

are very similar. I would recommend curve_fit. In this case it's 6000x faster. The difference is a little bit more when the underlying data is more complex, but not by much, and you still get a huge speed up. Here's an example for 6 uniformly distributed normal distributions being fit:

enter image description here

Go with curve_fit!

Hey Will, and thanks for the answer. It does sort of work. I get a bunch of accuracy errors `/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/scipy/integrate/quadrature.py:699: AccuracyWarning: divmax (10) exceeded.` and something does not work when plotting. I agree; it does not look nice with all those globals. Normally I quite like plotting in functions, what do you find problematic about this, and whats the alternative? Also, being new to Python coding (coming from matlab) is there any alternative to defining globals in a case like this? — tmo, Mar 25 '15 at 09:07
I should add, that the optimisation does work. Its very slow though. I would like to make this as clean as possible. Also with plotting. Any ideas? — tmo, Mar 25 '15 at 09:14
I'll have a look at refactoring it, it can definitely be improved and made nicer. — will, Mar 25 '15 at 11:31
important question - does the fitting metric have to be that the overlap integral is maximized? Or would fitting a least squares gaussian fit be enough? I suspect that that integral is the slow part. — will, Mar 25 '15 at 11:44
**1)** Thanks a bunch, its much appreciated. A minimum working example would be fine, as I'm planning to implement the method on several functions (gaussian just being a easy first). **2)** Yes. It is crucial that it is the overlap that is maximised. I plan to generalise this, so I can fit different functions to the kernel PDF and evaluate performance (e.g. how much overlap is there?). **3)** What I have a hard time understanding right now, is the general use of `f(x)` in the code. Seems overly complicated with functions in functions? — tmo, Mar 25 '15 at 12:04
Ah the overlap is important, i'll make it do an overlap integral instead then. — will, Mar 25 '15 at 13:55
1. I think you should be able to extract what you need from what I've provided. 2. Why exactly do you need to evaluate the overlap? If you can define your fitting curve with variables you can use curve fit to obtain the variables 3. That was just to make partial functions that can easily be input into the integration routine. — will, Mar 26 '15 at 02:12
Hey Will, and thanks for the time you've put into this. I will look into the curve_fir func. It sounds like a better solution than just maximising overlap. — tmo, Mar 26 '15 at 09:32

Python: Overlap between two functions (PDF of kde and normal)

1 Answers1

Linked