2

I have a histogram of sorted random numbers and a Gaussian overlay. The histogram represents observed values per bin (applying this base case to a much larger dataset) and the Gaussian is an attempt to fit the data. Clearly, this Gaussian does not represent the best fit to the histogram. The code below is the formula for a Gaussian.

normc, mu, sigma = 30.845, 50.5, 7 # normalization constant, avg, stdev
gauss = lambda x: normc * exp( (-1) * (x - mu)**2 / ( 2 * (sigma **2) ) )

I calculated the expectation values per bin (area under the curve) and calculated the number of observed values per bin. There are several methods to find the 'best' fit. I am concerned with the best fit possible by minimizing Chi-Squared. In this formula for Chi-Squared, the expectation value is the area under the curve per bin and the observed value is the number of occurrences of sorted data values per bin. So I want to fluctuate normc, mu, and sigma near their given values to find the right combination of normc, mu, and sigma that produce the smallest Chi-Square, as these will be the parameters I can plug into the code above to overlay the best fit Gaussian on my histogram. I am trying to use the scipy module to minimize my Chi-Square as done in this example. Since I need to fluctuate parameters, I will use the function gauss (defined above) to plot the Gaussian overlay, and will define a new function to find the minimum Chi-Squared.

def gaussmin(var,data):
    # var[0] = normc
    # var[1] = mu
    # var[2] = sigma
    # data is the sorted random numbers, represents unbinned observed values
    for index in range(len(data)):
        return var[0] * exp( (-1) * (data[index] - var[1])**2 / ( 2 * (var[2] **2) ) ) 
    # I realize this will return a new value for each index of data, any guidelines to fix?

After this, I am stuck. How can I fluctuate the parameters to find the normc, mu, sigma that produced the best fit? My last attempt at a solution is below:

var = [normc, mu, sigma]
result = opt.minimize(chi2, [normc,mu,sigma])
# chi2 is the chisquare value obtained via scipy
# chisquare input (a,b) 
# where a is number of occurences per bin, b is expected value per bin
# b is dependent upon normc, mu, sigma
print(result)
# data is a list, can I keep it as a constant and only fluctuate parameters in var?

There are plenty of examples online for scalar functions but I cannot find any for variable functions.

PS - I can post my full code so far but it's bit lengthy. If you would like to see it, just ask and I can post it here or provide a googledrive link.

  • 2
    1) A normal has only 2 parameters (mu, sigma) and the integration constant (`normc`) is `1 / sqrt(2 * PI *(sigma ** 2))`, so your `normc` is a) unnecessary and b) wrong. 2) What is `gaussmin`? It is not related to the `((o - e) ** 2) / e` terms whose sum you want to minimize. 3) Your first job is to calculate the `e` for each bin. Let `PHI(x)` be the cumulative distribution function of the normal distribution. Then `e = (PHI(xu) - PHI(xl)) * N` where `xu` and `xl` are the upper / lower bounds of the bin and `N` is the total number of observations. Now derive the function to be minimized. – Stefan Zobel Feb 26 '17 at 16:47
  • I made a few changes to the code to fix errors, such as the gaussmin from your second suggestion. Also, I have e per bin via scipy integration of the gaussian (lambda) function of x over bin bounds in my histogram. I was hoping to generalize my code in such a way that normalizing would be optional. If I understand you correctly, you are saying that I should normalize my distribution, compare the sum(expectation values per bin) from my unnormalized data to the sum(expectation values per bin) from my normalized data, and accordingly rescale my normalization constant. Is this correct? –  Feb 26 '17 at 20:24
  • The chi-square function you want to minimize is `Sum over all ((oi - ei) ** 2) / ei` where `i` is the bin index and `oi` is the number of observations in bin `i`. `ei` is a function of `xu` and `xl` for bin `i` and of `mu` and `sigma` via `PHI(x | mu, sigma)`. Throw that at a nonlinear optimization routine and pray that it finds a solution that obeys boundary conditons (especially, `sigma > 0`). – Stefan Zobel Feb 26 '17 at 21:52
  • To be more explicit: You are comparing distances of **counts** per bin (either observed or expected). So, no "normalizing" or "rescaling" is going on. The only "rescaling" that is involved is the multiplication by `N` in the computation of `ei` to get an expected **count** from the probability that the normal assigns to the bin under consideration. – Stefan Zobel Feb 26 '17 at 22:08
  • Thanks for the suggestions. Since I am not 100% sure of the method you describe, I went a different route. Using an initial guess for varying parameters, I create a linspace from parameter - num to parameter + num. Then I created a function that outputs a list all combinations of varying parameters in a list; this output is input into a different function that integrates the expectation value per list of varied parameters per bin. Then, I use chisquare via scipy (much more efficient than hard-coding, thanks Jacobian) to generate a list of chi-squared values, the minimum of which represents ... –  Feb 27 '17 at 10:12
  • ... my optimized parameters, obtainable via their index. It works for a simple model of random numbers. All that is left is to adapt the code to my datasets. This solves my issue because the integrations per bin (via for loop) converts my variable function to a scalar function. –  Feb 27 '17 at 10:13
  • I believe we are talking at cross-purposes. Your [chi-square](https://i.stack.imgur.com/o7kGN.gif) objective function **is** a scalar function taking only `mu` and `sigma` as parameters. You could access everything else (bin boundaries and observed counts) from inside that function. There's simply no need to vary the parameters yourself - the minimization routine does that for you. – Stefan Zobel Feb 27 '17 at 13:56
  • If I understand you correctly: ` from scipy.optimize import minimize from scipy.stats import chisquare from scipy.integrate import quad ` I can get the observed values per bin. I need to perform an integral over a function (say, G(x) is the Gaussian function above) to obtain the expected value per bin. ` result = minimize( chisquare( obsvals, expvals ) ) ` I don't think that would work. The input expvals are dependent upon integration of the Gaussian function. What am I missing? –  Feb 27 '17 at 14:09
  • Disclaimer: I don't know Python or scipy. I think you are getting closer. `chisquare( obsvals, expvals )` would yield the return value of the function that you pass to `minimize`. Before that, your function has to calculate `expvals` from the `[mu, sigma]` parameters using an integration routine. The function that gets passed to minimize has only `[mu, sigma]` as parameters. That's the basic setup. – Stefan Zobel Feb 27 '17 at 14:14
  • I take this to mean that the input expvals must be the output of a different function that integrates over the different combinations of parameters and spits them out after each iteration. Yes? –  Feb 27 '17 at 14:17
  • In pseudo-code: `def f(mu, sigma): #1: compute expvals from (mu, sigma) then #2: compute and return chisquare(obsvals, expvals)`. Then call `minimize` passing `f` and reasonable initial values for `mu`and `sigma`. – Stefan Zobel Feb 27 '17 at 14:29
  • I reformulated the question here after making progress --> https://stackoverflow.com/questions/42490281/how-do-i-pass-through-arguments-to-other-functions-generally-and-via-scipy –  Feb 27 '17 at 16:03

1 Answers1

2

A Gaussian distribution is completely characterized by its mean and variance (or std deviation). Under the hypothesis that your data are normally distributed, the best fit will be obtained by using x-bar as the mean and s-squared as the variance. But before doing so, I'd check whether normality is plausible using, e.g., a q-q plot.

pjs
  • 18,696
  • 4
  • 27
  • 56
  • 1
    As I understand it, the OP wants to derive the estimators for `mu` and `sigma` using [minimum chi-squared estimation](http://projecteuclid.org/download/pdf_1/euclid.aos/1176345003). A pretty old technique that was discussed even before Fisher came up with MLE. – Stefan Zobel Feb 26 '17 at 17:07
  • @StefanZobel what's the point, when it's well established that x-bar and s-squared are functions of the joint sufficient statistics of a normal, and are the MLE's? – pjs Feb 26 '17 at 17:14
  • Good question. I also don't think it's appropriate. But there were (are?) statisticians who argue(d) that minimum chi-square estimators can have somehow "better" properties in the finite sample case. See the Berkson paper I've linked to. – Stefan Zobel Feb 26 '17 at 17:22
  • @StefanZobel From the abstract: "Minimum chi-squared-sub-lambda yields the same estimating equations as MLE." They seem to be arguing for chi-square estimators where MLEs don't exist or aren't functions of the sufficient statistics. But to me, this means use MLE for the normal. – pjs Feb 26 '17 at 17:39
  • The "sub-lambda" estimator is not the one the OP want's to use here. But again, I don't disagree with you. Just wanted to give the OP a hint in which direction he'd have to go. His problem may very well be a XY problem. – Stefan Zobel Feb 26 '17 at 17:52
  • @StefanZobel You're right, I should read the whole paper rather than just the abstract. Too much grading to do first, though... Anyhow, thanks for the input. – pjs Feb 26 '17 at 17:55