My question is quick, but I have provided a hefty piece of code to better illustrate my problem since I have not understood the answer from reading related posts.
The code below is to pick optimized parameters that are part of an args list. The args list should be a single entry like x0 on scipy docs. I am hoping to find the right combination of args to fit the data the best. The scipy optimize modules are supposed to fluctuate the values of my args to find the combination that minimizes my error the greatest. But, I am having trouble passing args from one function to another.
Sometimes I put a *
or **
but my success rate is more miss than hit. I want to know how to pass args from one function to another while allowing them to change value so as to find their optimized value. (The optimized value reduces error, explained below). I have a few functions that serve as inputs in other functions and am missing a key concept here. Are kwargs necessary for something like this? If the args are a tuple, can they still change value to find optimized parameters? I'm aware that somewhat similar questions have been asked here on SO but I haven't been able to figure it with these resources yet.
The code is explained below (after imports).
import numpy as np
import random
import matplotlib.pyplot as plt
from math import exp
from math import log
from math import pi
from scipy.integrate import quad ## integrate f(x)dx from x_i to x_i+1
from scipy.stats import norm
from scipy.stats import chisquare
from scipy.optimize import basinhopping
from scipy.stats import binned_statistic as bstat
I generated a random Gaussian distribution sample of 1000 data points, with the average mu = 48 and the standard deviation sigma = 7. I can histogram the data, and my goal is to find the parameters mu, sigma, and normc (scaling factor or normalization constant) that produce the best fit to a histogram of the sample data. There are many error analysis methods but for my purpose, the best fit is determined as the fit that minimizes Chi-Square (described a little further below). I know the code is long (too long even), but my question requires a bit of setup.
## generate data sample
a, b = 48, 7 ## mu, sigma
randg = []
for index in range( 1000 ):
randg.append( random.gauss(a,b) )
data = sorted( randg )
small = min( data )
big = max( data )
domain = np.linspace(small,big,3000) ## for fitted plot overlay on histogram of data
I then organized my bins for the histogram.
numbins = 30 ## number of bins
def binbounder( small , big , numbins ):
## generates list of bound bins for histogram ++ bincount
binwide = ( big - small ) / numbins ## binwidth
binleft = [] ## left edges of bins
for index in range( numbins ):
binleft.append( small + index * binwide )
binbound = [val for val in binleft]
binbound.append( big ) ## all bin edges
return binbound
binborders = binbounder( small , big , numbins )
## useful if one performs plt.hist(data, bins = binborders, ...)
def binmidder( small , big , numbins ):
## all midtpts of bins
## for x-ticks on histogram
## useful to visualize over/under -estimate of error
binbound = binbounder( small , big , numbins )
binwide = ( big - small ) / numbins
binmiddles = []
for index in range( len( binbound ) - 1 ):
binmiddles.append( binwide/2 + index * binwide )
return binmiddles
binmids = binmidder( small , big , numbins )
To perform Chi-Square analysis, one must input the expectation values per bin (E_i) and multiplicities of observed values per bin (O_i) and output the sum over all the bins of the square of their difference over the expectation value per bin.
def countsperbin( xdata , args = [ small , big , numbins ]):
## calculates multiplicity of observed values per bin
binborders = binbounder( small , big , numbins )
binmids = binmidder( small , big , numbins )
values = sorted( xdata ) ## function(xdata) ~ f(x)
bincount = []
for jndex in range( len( binborders ) ):
if jndex != len( binborders ) - 1:
summ = 0
for val in values:
if val > binborders[ jndex ] and val <= binborders[ jndex + 1 ]:
summ += 1
bincount.append( summ )
if jndex == len( binborders ) - 1:
pass
return bincount
obsperbin = countsperbin( binborders , data ) ## multiplicity of observed values per bin
Each expectation value per bin, which is needed to calculate and minimize Chi Squared, is defined as the integral of the distribution function from x_i = left binedge to x_i+1 = right binedge.
I want a reasonable initial guess for my optimized parameters, as these will give me a reasonable guess for a minimized Chi Squared. I choose mu, sigma, and normc to be close to but not equal to their true values so that I can test if the minimization worked.
def maxbin( perbin ):
## perbin is a list of observed data per bin
## returns largest multiplicity of observed values with index
## useful to help guess scaling factor "normc" (outside exponential in GaussDistrib)
for index, maxval in enumerate( perbin ):
if maxval == max( perbin ):
optindex = index
return optindex, perbin[ optindex ]
mu, sigma, normc = np.mean( data ) + 30, np.std( data ) + 20, maxbin( obsperbin )
Since we are integrating f(x)dx, the data points (or xdata) is irrelevant here.
def GaussDistrib( xdata , args = [ mu , sigma , normc ] ): ## G(x)
return normc * exp( (-1) * (xdata - mu)**2 / (2 * sigma**2) )
def expectperbin( args ):
## calculates expectation values per bin
## needed with observation values per bin for ChiSquared
## expectation value of single bin is equal to area under Gaussian curve from left binedge to right binedge
## area under curve for ith bin = integral G(x)dx from x_i (left edge) to x_i+1 (right edge)
ans = []
for index in range(len(binborders)-1): # ith index does not exist for rightmost boundary
ans.append( quad( GaussDistrib , binborders[ index ] , binborders[ index + 1 ], args = [ mu , sigma , normc ])[0])
return ans
My defined function chisq
calls chisquare
from the scipy module to return a result.
def chisq( args ):
## args[0] = mu
## args[1] = sigma
## args[2] = normc
## last subscript [0] gives chi-squared value, [1] gives 0 ≤ p-value ≤ 1
## can also minimize negative p-value to find best fitting chi square
return chisquare( obsperbin , expectperbin( args[0] , args[1] , args[2] ))[0]
I do not know how but I would like to place constraints on my system. Specifically, the max of the list of heights of the binned data must be greater than zero (as must Chi Square due to the exponential term that remains after differentiating).
def miniz( chisq , chisqguess , niter = 200 ):
minimizer = basinhopping( chisq , chisqguess , niter = 200 )
## Minimization methods available via https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.html
return minimizer
expperbin = expectperbin( args = [mu , sigma , normc] )
# chisqmin = chisquare( obsperbin , expperbin )[0]
# chisqmin = result.fun
""" OPTIMIZATION """
print("")
print("initial guess of optimal parameters")
initial_mu, initial_sigma, initial_normc = np.mean(data)+30 , np.std(data)+20 , maxbin( (obsperbin) )
## check optimized result against: mu = 48, sigma = 7 (via random number generator for Gaussian Distribution)
chisqguess = chisquare( obsperbin , expectperbin( args[0] , args[1] , args[2] ))[0]
## initial guess for optimization
result = miniz( chisqguess, args = [mu, sigma, normc] )
print(result)
print("")
The point of the minimization was to find the optimized parameters that give the best fit.
optmu , optsigma , optnormc = result.x[0], abs(result.x[1]), result.x[2]
chisqcheck = chisquare(obsperbin, expperbin)
chisqmin = result.fun
print("chisqmin -- ",chisqmin," ",chisqcheck," -- check chi sq")
print("""
""")
## CHECK
checkbins = bstat(xdata, xdata, statistic = 'sum', bins = binborders) ## via SCIPY (imports)
binsum = checkbins[0]
binedge = checkbins[1]
binborderindex = checkbins[2]
print("binsum",binsum)
print("")
print("binedge",binedge)
print("")
print("binborderindex",binborderindex)
# Am I doing this part right?
tl;dr: I want result
, which calls the function minimiz
, which calls a scipy module to minimize Chi Squared using a guess value. Chi Squared and the guess value each call other functions, etc. How can I pass my args through the right way?