How can I pass arguments through a chain of nested functions to calculate a result?

Question

My question is quick, but I have provided a hefty piece of code to better illustrate my problem since I have not understood the answer from reading related posts.

The code below is to pick optimized parameters that are part of an args list. The args list should be a single entry like x0 on scipy docs. I am hoping to find the right combination of args to fit the data the best. The scipy optimize modules are supposed to fluctuate the values of my args to find the combination that minimizes my error the greatest. But, I am having trouble passing args from one function to another.

Sometimes I put a * or ** but my success rate is more miss than hit. I want to know how to pass args from one function to another while allowing them to change value so as to find their optimized value. (The optimized value reduces error, explained below). I have a few functions that serve as inputs in other functions and am missing a key concept here. Are kwargs necessary for something like this? If the args are a tuple, can they still change value to find optimized parameters? I'm aware that somewhat similar questions have been asked here on SO but I haven't been able to figure it with these resources yet.

The code is explained below (after imports).

import numpy as np
import random
import matplotlib.pyplot as plt
from math import exp
from math import log
from math import pi
from scipy.integrate import quad ## integrate f(x)dx from x_i to x_i+1
from scipy.stats import norm
from scipy.stats import chisquare
from scipy.optimize import basinhopping
from scipy.stats import binned_statistic as bstat

I generated a random Gaussian distribution sample of 1000 data points, with the average mu = 48 and the standard deviation sigma = 7. I can histogram the data, and my goal is to find the parameters mu, sigma, and normc (scaling factor or normalization constant) that produce the best fit to a histogram of the sample data. There are many error analysis methods but for my purpose, the best fit is determined as the fit that minimizes Chi-Square (described a little further below). I know the code is long (too long even), but my question requires a bit of setup.

## generate data sample
a, b = 48, 7 ## mu, sigma
randg = []
for index in range( 1000 ):
    randg.append( random.gauss(a,b) )
data = sorted( randg )

small = min( data )
big = max( data )
domain = np.linspace(small,big,3000) ## for fitted plot overlay on histogram of data

I then organized my bins for the histogram.

numbins = 30 ## number of bins

def binbounder( small , big , numbins ):
    ## generates list of bound bins for histogram ++ bincount
    binwide = ( big - small ) / numbins ## binwidth
    binleft = [] ## left edges of bins
    for index in range( numbins ):
        binleft.append( small + index * binwide )
    binbound = [val for val in binleft]
    binbound.append( big ) ## all bin edges
    return binbound

binborders = binbounder( small , big , numbins )
## useful if one performs plt.hist(data, bins = binborders, ...)

def binmidder( small , big , numbins ):
    ## all midtpts of bins
    ## for x-ticks on histogram
    ## useful to visualize over/under -estimate of error
    binbound = binbounder( small , big , numbins )
    binwide = ( big - small ) / numbins
    binmiddles = []
    for index in range( len( binbound ) - 1 ):
        binmiddles.append( binwide/2 + index * binwide )
    return binmiddles

binmids = binmidder( small , big , numbins )

To perform Chi-Square analysis, one must input the expectation values per bin (E_i) and multiplicities of observed values per bin (O_i) and output the sum over all the bins of the square of their difference over the expectation value per bin.

def countsperbin( xdata , args = [ small , big , numbins ]):
    ## calculates multiplicity of observed values per bin
    binborders = binbounder( small , big , numbins )
    binmids = binmidder( small , big , numbins )
    values = sorted( xdata ) ## function(xdata) ~ f(x)
    bincount = []
    for jndex in range( len( binborders ) ):
        if jndex != len( binborders ) - 1:
            summ = 0
            for val in values:
                if val > binborders[ jndex ] and val <= binborders[ jndex + 1 ]:
                    summ += 1
            bincount.append( summ )
        if jndex == len( binborders ) - 1:
            pass
    return bincount

obsperbin = countsperbin( binborders , data ) ## multiplicity of observed values per bin

Each expectation value per bin, which is needed to calculate and minimize Chi Squared, is defined as the integral of the distribution function from x_i = left binedge to x_i+1 = right binedge.

I want a reasonable initial guess for my optimized parameters, as these will give me a reasonable guess for a minimized Chi Squared. I choose mu, sigma, and normc to be close to but not equal to their true values so that I can test if the minimization worked.

def maxbin( perbin ):
    ## perbin is a list of observed data per bin
    ## returns largest multiplicity of observed values with index
    ## useful to help guess scaling factor "normc" (outside exponential in GaussDistrib)
    for index, maxval in enumerate( perbin ):
        if maxval == max( perbin ):
            optindex = index
    return optindex, perbin[ optindex ] 

mu, sigma, normc = np.mean( data ) + 30, np.std( data ) + 20, maxbin( obsperbin )

Since we are integrating f(x)dx, the data points (or xdata) is irrelevant here.

def GaussDistrib( xdata , args = [ mu , sigma , normc ] ): ## G(x)
    return normc * exp( (-1) * (xdata - mu)**2 / (2 * sigma**2) )

def expectperbin( args ):
    ## calculates expectation values per bin
    ## needed with observation values per bin for ChiSquared
    ## expectation value of single bin is equal to area under Gaussian curve from left binedge to right binedge
    ## area under curve for ith bin = integral G(x)dx from x_i (left edge) to x_i+1 (right edge)
    ans = []
    for index in range(len(binborders)-1): # ith index does not exist for rightmost boundary
        ans.append( quad( GaussDistrib , binborders[ index ] , binborders[ index + 1 ], args = [ mu , sigma , normc ])[0])
    return ans

My defined function chisq calls chisquarefrom the scipy module to return a result.

def chisq( args ):
    ## args[0] = mu
    ## args[1] = sigma
    ## args[2] = normc
    ## last subscript [0] gives chi-squared value, [1] gives 0 ≤ p-value ≤ 1
    ## can also minimize negative p-value to find best fitting chi square
    return chisquare( obsperbin , expectperbin( args[0] , args[1] , args[2] ))[0]

I do not know how but I would like to place constraints on my system. Specifically, the max of the list of heights of the binned data must be greater than zero (as must Chi Square due to the exponential term that remains after differentiating).

def miniz( chisq , chisqguess , niter = 200 ):
    minimizer = basinhopping( chisq , chisqguess , niter = 200 )
    ## Minimization methods available via https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.html
    return minimizer

expperbin = expectperbin( args = [mu , sigma , normc] )
# chisqmin = chisquare( obsperbin , expperbin )[0]
# chisqmin = result.fun

""" OPTIMIZATION """

print("")
print("initial guess of optimal parameters")

initial_mu, initial_sigma, initial_normc = np.mean(data)+30 , np.std(data)+20 , maxbin( (obsperbin) )
## check optimized result against:  mu = 48, sigma = 7 (via random number generator for Gaussian Distribution)

chisqguess = chisquare( obsperbin , expectperbin( args[0] , args[1] , args[2] ))[0]
## initial guess for optimization

result = miniz( chisqguess, args = [mu, sigma, normc] )
print(result)
print("")

The point of the minimization was to find the optimized parameters that give the best fit.

optmu , optsigma , optnormc = result.x[0], abs(result.x[1]), result.x[2]

chisqcheck = chisquare(obsperbin, expperbin)
chisqmin = result.fun
print("chisqmin --  ",chisqmin,"        ",chisqcheck," --   check chi sq")

print("""
""")

## CHECK
checkbins = bstat(xdata, xdata, statistic = 'sum', bins = binborders) ## via SCIPY (imports)
binsum = checkbins[0]
binedge = checkbins[1]
binborderindex = checkbins[2]
print("binsum",binsum)
print("")
print("binedge",binedge)
print("")
print("binborderindex",binborderindex)
# Am I doing this part right?

tl;dr: I want result, which calls the function minimiz, which calls a scipy module to minimize Chi Squared using a guess value. Chi Squared and the guess value each call other functions, etc. How can I pass my args through the right way?

Just a quick point, `args = (mu, sigma, normc])`. `args` is usually a tuple, not a list. It does `args1 = (x,)+args`, `your_function(*args1)`. — hpaulj, Mar 03 '17 at 03:56
If I put the parameters to be optimized in a tuple, can they still be varied? I thought entries in tuples could not be changed? Also, does this mean I should put a tuple of sub-tuples, each sub-tuple containing a possible combination of (mu,sigma,normc) to check chi-square values against? — , Mar 03 '17 at 03:58
`args` contains the current constants; optimization is over the 1st argument. `args` values can be varied in some outer loop. — hpaulj, Mar 03 '17 at 04:19
Is there a way to optimize over 3 arguments or is this better thrown in some type of iterator of optimized values? And how can I pass them to an outer loop? — , Mar 03 '17 at 04:25
I'm aware of some by scipy but I have trouble getting them to work properly. — , Mar 03 '17 at 04:52

score 3 · Accepted Answer · edited May 23 '17 at 12:18

3

You can access all information from the OptimizeResult that is returned from optimize.basinhopping.

I've abstracted away the generation of the random sample and reduced the number of your functions to those 5 functions that are really needed to run the optimization.

The only "tricky" part in parameter passing is to pass the parameters mu and sigma to the GaussDistrib function inside the quad call, but that is readily explained in the quad doc. Other than that, I fail to see a real problem with parameter passing here.

Your prolonged use of normc is misguided. You don't get correct values from the Gaussian that way (and there is no need to vary 3 independent parameters when 2 are sufficient). Also, to get the correct values for chi square, you have to multiply the probabilities from the Gaussian with the sample count (you are comparing absolute counts from the obsperbin bins with probabilities under the Gaussian - which is clearly wrong).

from math import exp
from math import pi
from scipy.integrate import quad
from scipy.stats import chisquare
from scipy.optimize import basinhopping


# smallest value in the sample
small = 26.55312337811099
# largest value in the sample
big   = 69.02965763016027

# a random sample from N(48, 7) with 999 sample
# values binned into 30 equidistant bins ranging
# from 'small' (bin[0] lower bound) to 'big'
# (bin[29] upper bound) 
obsperbin = [ 1,  1,  2,  4,  8, 10, 13, 29, 35, 45,
             51, 56, 63, 64, 96, 89, 68, 80, 61, 51,
             49, 30, 34, 19, 22,  3,  7,  5,  1,  2]

numbins = len(obsperbin) #  30
numobs  = sum(obsperbin) # 999

# intentionally wrong guesses of mu and sigma
# to be provided as optimizer's initial values
initial_mu, initial_sigma = 78.5, 27.0


def binbounder( small , big , numbins ):
    ## generates list of bound bins for histogram ++ bincount
    binwide = ( big - small ) / numbins ## binwidth
    binleft = [] ## left edges of bins
    for index in range( numbins ):
        binleft.append( small + index * binwide )
    binbound = [val for val in binleft]
    binbound.append( big ) ## all bin edges
    return binbound

# setup the bin borders
binborders = binbounder( small , big , numbins )


def GaussDistrib( x , mu , sigma ):
    return 1/(sigma * (2*pi)**(1/2)) * exp( (-1) * (x - mu)**2 / ( 2 * (sigma **2) ) )


def expectperbin( musigma ):
    ## musigma[0] = mu
    ## musigma[1] = sigma
    ## calculates expectation values per bin
    ## expectation value of single bin is equal to area under Gaussian
    ## from left binedge to right binedge multiplied by the sample size
    e = []
    for i in range(len(binborders)-1): # ith i does not exist for rightmost boundary
        e.append( quad( GaussDistrib , binborders[ i ] , binborders[ i + 1 ],
                         args = ( musigma[0] , musigma[1] ))[0] * numobs)
    return e


def chisq( musigma ):
    ## first subscript [0] gives chi-squared value, [1] gives 0 = p-value = 1
    return chisquare( obsperbin , expectperbin( musigma ))[0]


def miniz( chisq , musigma ):
    return basinhopping( chisq , musigma , niter = 200 )


## chisquare value for initial parameter guess
chisqguess = chisquare( obsperbin , expectperbin( [initial_mu , initial_sigma] ))[0]

res = miniz( chisq, [initial_mu , initial_sigma] )

print("chisquare from initial guess:" , chisqguess)
print("chisquare after optimization:" , res.fun)
print("mu, sigma after optimization:" , res.x[0], ",", res.x[1])

chisquare from initial guess: 3772.70822797

chisquare after optimization: 26.351284911784447

mu, sigma after optimization: 48.2701027439, 7.046156286

Btw, basinhopping is overkill for this kind of problem. I'd stay with fmin (Nelder-Mead).

edited May 23 '17 at 12:18

Community

1
1

answered Mar 03 '17 at 18:38

Stefan Zobel

3,182
7
28
38

If I understand correctly, GaussDistrib is called by expectperbin. Since expectperbin input is args and GaussDistrib inputs are x and (mu, sigma), this means mu, sigma must be specified as args when passed from expectperbin to GaussDistrib. So any function that uses the same inputs from and calls expectperbin do not need to specify args. Had GaussDistrib input been only x, I would need a separate function that specifies and returns the args to be used by the outer functions. Correct? As for the minimization method, basinhop is specified as one of 3 global min solvers while fmin is not. – Mar 04 '17 at 00:52
Also, I know what the value of normc should be for a normalized distribution. Why is it that the solver shouldn't/can't optimize for normc? Is it strictly an efficiency issue? – Mar 04 '17 at 00:56
`normc`, the integration constant, has to be `1 / sqrt(2 * PI *(sigma ** 2))` (where `sigma` can vary during the optimization). Otherwise the function is no longer a probability density (doesn't integrate to `1`). Probabilities from the area under the curve would no longer be probabilities. There's no way a normal distribution could have 3 independent parameters. It's about mathematical correctness, not efficiency. And there's no reason to believe that, if introduced, `normc` would converge to the correct value under optimization. Instead you'd get a useless / wrong result for all 3 variables. – Stefan Zobel Mar 04 '17 at 11:47
As for your first comment. Sorry, I didn't understand it. – Stefan Zobel Mar 04 '17 at 11:51
That makes sense. I was originally trying to scale my data down to a normalized fit instead of normalizing to get the original unnormalized histogram back. As for args passing, I don't quite understand why the syntax for args is changing from function to function. I think it has to do with which function is passing args to another function but I'm not 100% clear. In your code, chisqguess specifies which param is which arg; chisq takes args as an input; expectperbin separates args that are called by inputs of gaussdistrib. But I don't follow what specifically causes which change in syntax. – Mar 04 '17 at 12:00
I've edited and renamed the parameters. Perhaps, that will make it more clear. – Stefan Zobel Mar 04 '17 at 12:01
You have been extremely helpful. I've corrected old mistakes and improved my understanding greatly because of your help. I appreciate you going above and beyond for me, thanks! – Mar 04 '17 at 12:05
The "extra" parameters (mu, sigma) passed to GaussDistrib from `quad` have to be a named tuple called `args`. Thats's from the definition of `quad`. All other functions can use what they want. we are using lists here, but it could be something else. – Stefan Zobel Mar 04 '17 at 12:07
I get that quad args must be a tuple and it must be called args. But for example, the args are specified in the appending line of expectperbin, before the return statement; is this necessary or a choice of convenience? I ask because specifying args in a similar way for functions that do not return a list doesn't work (unless I'm missing something). Also, are they specified as such because quad calls on the function gaussdistrib which in turn uses args as inputs? – Mar 04 '17 at 12:20
"are they specified as such because quad calls on the function gaussdistrib which in turn uses args as inputs?". Well, quad extracts the parameters from the tuple and passes them as single values to GaussDistrib. – Stefan Zobel Mar 04 '17 at 12:22
This has been extremely useful. Thanks again!! – Mar 04 '17 at 12:25
If I can ask one more thing: in the function expectperbin, theres a line that reads `args = ( musigma[0] , musigma[1] ))[0] * numobs)`. Why the `* numobs` at the end? Is it unpacking the tuple numobs times? – Mar 04 '17 at 13:12
It's multiplying the probability for bin[i] by the sample count to get the expected number of observations for that bin and a sample of the size you happen to have. You'd get wrong chi square values if you forget that step. – Stefan Zobel Mar 04 '17 at 13:49

How can I pass arguments through a chain of nested functions to calculate a result?

1 Answers1