0

I want to fit a histogram by the sum of two gaussians, both with different amplitude, mean and deviation. To do that, I have used scipy's curve_fit, but the KS-test afterwards was awful. That was mostly because the first few (as in the most negativ x values) values were not very accurate, and therefor the cumulative function was way off. I also noted that the cumulative function was off by 20%, and therefor an accurate outcome of the KS-test is impossible.

Then I tried to make a constraint to the integrand, following this question. The relevant code I got is the following (without importing and plotting part):

def residuals(p, x,y):
    integral = quad( gauss2, -300, 300, args= (p[0],p[1],p[2],p[3],p[4],p[5]))[0]
    penalization = abs(1-integral)*10000
    print penalization
    return y - gauss2(x, p[0],p[1],p[2],p[3],p[4],p[5] ) - penalization
def gauss2(x,A, mu, sigma, A2, mu2, sigma2):
    if A2<0:
      return 1000
    return A*np.exp(-(x-mu)**2/(2.*sigma**2))+ A2*np.exp(-(x-mu2)**2/(2.*sigma2**2))

hist, bin_edges = np.histogram(data, normed=True, bins=bins)
hist_cm=np.cumsum(hist)
bin_centres = (bin_edges[:-1] + bin_edges[1:])/2
coeff, pcov2 = leastsq(residuals, x0=(0.01,0.,60.,0.01,150.,40.) ,args=(bin_centres, hist)
hist_fit = gauss2(bin_centres, *coeff)
hist_fit_cm=np.cumsum(hist_fit)
KStest= stats.ks_2samp(hist_cm,hist_fit_cm)

This results in a pretty good estimate, and a P-value of 0.629. As far as I know, this means that the histogram and the fit have a 62.9% change of coming from the same data, is this correct?

Now I thought that I could improve the answer by not penalising for the integrand, but for the P-value. For this I changed the def residuals with the following:

def residuals(p, x,y):
    global bin_centres #its global defined, so should be good
    iets = np.cumsum(gauss2(bin_centres,p[0],p[1],p[2],p[3],p[4],p[5]))
    pizza=stats.ks_2samp(np.cumsum(y),iets)[1]
    penalization = 1000*(1-pizza)
    return y - gauss2(x, p[0],p[1],p[2],p[3],p[4],p[5] ) - penalization

Since the P-value (which I call pizza) should reach as close to 1 as possible, the penalization becomes smaller with a higher P-value. But this gives results which make less sense: the P-value turns out to be 0.160. When plotting it's even worse: two spikes, instead of the smooth fit I obtained with the first method.

Is a KS-test a good penalisation method, instead of the integrand? How can implement it in a good way then?

Community
  • 1
  • 1
Mathias711
  • 6,568
  • 4
  • 41
  • 58

1 Answers1

0

(brief answer, as far as I understand reading the code)

The first penalization penalization = abs(1-integral)*10000 is a constraint on the total integral. I think this is the same as imposing A + A2 == 1 so the mixture in gauss2 integrates to one. An alternative without constraints would be to impose this directly by, for example, using a Logit function for the mixing probability.

The Kolmogorov-Smirnov penalization uses a L1 distance and penalized the largest deviation between the empirical and the parametric cdf, approximately (*)

L1 = np.max(np.abs(np.cumsum(y) - iets))

The p-value is just a monotonic transformation of the L1 distance, but will have a different curvature and will penalize differently.

(*) The actual calculation looks at all the step points directly.

As aside: The Kolmogorov-Smirnov test is designed for continuous not for discrete or binned variables. The appropriate distance measure would be based on chi-square test or power divergence. However, this only affects ks_2samp as a hypothesis test, and not if we just use it as a distance measure.

Another aside: the use of integrate.quad could be replaced by using norm.cdf directly.

Josef
  • 21,998
  • 3
  • 54
  • 67