I want to fit a histogram by the sum of two gaussians, both with different amplitude, mean and deviation. To do that, I have used scipy's curve_fit, but the KS-test afterwards was awful. That was mostly because the first few (as in the most negativ x values) values were not very accurate, and therefor the cumulative function was way off. I also noted that the cumulative function was off by 20%, and therefor an accurate outcome of the KS-test is impossible.
Then I tried to make a constraint to the integrand, following this question. The relevant code I got is the following (without importing and plotting part):
def residuals(p, x,y):
integral = quad( gauss2, -300, 300, args= (p[0],p[1],p[2],p[3],p[4],p[5]))[0]
penalization = abs(1-integral)*10000
print penalization
return y - gauss2(x, p[0],p[1],p[2],p[3],p[4],p[5] ) - penalization
def gauss2(x,A, mu, sigma, A2, mu2, sigma2):
if A2<0:
return 1000
return A*np.exp(-(x-mu)**2/(2.*sigma**2))+ A2*np.exp(-(x-mu2)**2/(2.*sigma2**2))
hist, bin_edges = np.histogram(data, normed=True, bins=bins)
hist_cm=np.cumsum(hist)
bin_centres = (bin_edges[:-1] + bin_edges[1:])/2
coeff, pcov2 = leastsq(residuals, x0=(0.01,0.,60.,0.01,150.,40.) ,args=(bin_centres, hist)
hist_fit = gauss2(bin_centres, *coeff)
hist_fit_cm=np.cumsum(hist_fit)
KStest= stats.ks_2samp(hist_cm,hist_fit_cm)
This results in a pretty good estimate, and a P-value of 0.629. As far as I know, this means that the histogram and the fit have a 62.9% change of coming from the same data, is this correct?
Now I thought that I could improve the answer by not penalising for the integrand, but for the P-value. For this I changed the def residuals
with the following:
def residuals(p, x,y):
global bin_centres #its global defined, so should be good
iets = np.cumsum(gauss2(bin_centres,p[0],p[1],p[2],p[3],p[4],p[5]))
pizza=stats.ks_2samp(np.cumsum(y),iets)[1]
penalization = 1000*(1-pizza)
return y - gauss2(x, p[0],p[1],p[2],p[3],p[4],p[5] ) - penalization
Since the P-value (which I call pizza
) should reach as close to 1 as possible, the penalization
becomes smaller with a higher P-value. But this gives results which make less sense: the P-value turns out to be 0.160. When plotting it's even worse: two spikes, instead of the smooth fit I obtained with the first method.
Is a KS-test a good penalisation method, instead of the integrand? How can implement it in a good way then?