1

How would one allow for one parameter of a function to vary between groups while the other parameters are fit across all of the groups?

I am using lmfit to fit a model for disease spread. I would like the exponents of the function to be fit across all data points, but the scaling factor needs to vary between different groups (to act as a proxy for different strains of the disease with varying reproduction rates in different years).

See my code below:

#### create parameters ####
params = Parameters()
params.add('tau_1', value=1.,min=0.01)
params.add('tau_2', value=1.,min=0.01)
params.add('rho_1', value=1.,min=0.01)
for i in range(0, len(sorted(variable_data.flu_season.unique()))):
    season = str(sorted(variable_data.flu_season.unique())[i])
    params.add('theta_' + season, value=1., min = 0.01)

#### model ####
def pop_dist_inverse_grav(params, dist, host_pop, targ_pop, flu_sea, data):

    parvals = params.valuesdict()
    tau_1 = parvals['tau_1']
    tau_2 = parvals['tau_2']
    rho_1 = parvals['rho_1']

    theta_1 = parvals['theta_' + str(season)]

    grav_model = theta_1 * (np.power(dist, rho_1)) / ((np.power(host_pop, tau_1)) * (np.power(targ_pop, tau_2)))

    return grav_model - data

Following M Newville's advice I got the fitting to work. Code below:

import pandas as pd
from lmfit import minimize, Parameters, report_fit
import numpy as np

variable_data = pd.read_csv("../Data/variables_for_model_fit.csv", sep=",", header='infer')

season_list = sorted(variable_data.flu_season.unique())


#### create parameters ####
params1 = Parameters()
params1.add('tau_host', value=0.24,min=0, max = 3)
params1.add('tau_targ', value=0.14,min=0, max = 3)
params1.add('rho', value=0.29,min=0, max = 3)
# "global" parameters

for i, season in enumerate(season_list):
    params1.add('theta_%d' % season_list[i], value=1000., min=1, max=1e6)
    # creates theta parameters for each season

#### define model ####
def grav_dist_over_pop(dist, hist_pop, targ_pop, theta, rho, tau_host, tau_targ):
    return theta**(-1) * dist**rho * host_pop**(-tau_host) * targ_pop**(-tau_targ)

#### objective function ####
def objective_1(params, dist, host_pop, targ_pop, flu_sea, data):
    parvals1 = params1.valuesdict()
    resid = np.zeros((len(season_data),len(season_data)))

    for i, data in enumerate(data):

        theta = parvals1['theta_%d' % flu_sea[i]]
        rho = parvals1['rho']#_%d' % flu_sea[i]]
        tau_host = parvals1['tau_host']#_%d' % flu_sea[i]]
        tau_targ = parvals1['tau_targ']#_%d' % flu_sea[i]]
        model = grav_dist_over_pop(dist, host_pop, targ_pop, theta, rho, tau_host, tau_targ)
        resid[i, :] = model - data
    return resid.flatten()

#### fit global variables ####
season_data = variable_data.sample(500)
# my dataset is huge so lmfit takes an age when fitting all of the
# theta values
host_pop = np.asarray(season_data.host_city_pop.values.tolist())
targ_pop = np.asarray(season_data.target_city_pop.values.tolist())
dist = np.asarray(season_data.distance.values.tolist())
data = np.asarray(season_data.time_to_spread.values.tolist())
flu_sea = np.asarray(season_data.flu_season.values.tolist())

result = minimize(objective_1, params1, args=(dist, host_pop, targ_pop, flu_sea, data))

report_fit(result.params)
aL_eX
  • 1,453
  • 2
  • 15
  • 30
Lorcán
  • 555
  • 3
  • 15

1 Answers1

0

A fit with lmfit (or scipy.optimize and all other tools I am aware of) is always a "global" fit, finding optimal values for a single set of parameters that are used to minimize a 1D residual array. To be clear, you can optimize parameters to fit multiple data sets (or groups or seasons), but you have to reduce the problem to a single set of parameters calculating a 1D residual array.

For your problem, I would recommend starting with defining the "grav model" that models the data for a single season or dataset. I know nothing about this sort of field (but glad that lmfit might be useful!), but from your example this might look like this (if not, please correct)

def grav_season(dist, host_pop, targ_pop, theta, rho, tau_host, tau_targ):
    return theta * dist**rho * host_pop**(-tau_host) * targ_pop**(-tau_targ)

which appears to me to have 3 independent variables (dist, host_pop, targ_pop) and 4 parameters that might be variables: theta, rho, tau_host, tau_targ. Again, please correct if needed, but for the purposes here these details are not so important.

To fit a single season / dataset, you could do

from lmfit import Parameters, minimize, report_fit

def objective(params, data, dist, host_pop, targ_pop):
     pars = params.valuesdict()
     model = grav_season(dist, host_pop, targ_pop, pars['theta'],
                         pars['rho'], pars['tau_host'], pars['tau_targ'])
     return (model - data)

params = Parameters()
param.add('theta', value=1., min = 0.01)
param.add('rho', value=1., min = 0.01)
param.add('tau_host', value=1., min = 0.01)
param.add('tau_targ, value=1., min = 0.01)

result = minimize(objective, params, args=(data, dist, host_pop, targ_pop))
report_fit(result.params)

Now, for multiple seasons, you could simply make a theta, rho, tau_host, tau_targ parameter for each season. But,then there isn't much point in using all that data together in a single fit. If I understand correctly, you would like tau_host, tau_targ, and rho to have the same value for all seasons.

To do this, create Parameter for the globally-applied exponents:

params = Parameters()
param.add('rho', value=1., min = 0.01)
param.add('tau_host', value=1., min = 0.01)
param.add('tau_targ', value=1., min = 0.01)

and per-season parameters:

for i, season in enumerate(sorted(variable_data.flu_season.unique())):
     param.add('theta_%d' % i, value=1., min = 0.01)
     param.add('rho_%d' % i,  expr='rho')
     param.add('tau_host_%d' % i, expr='tau_host')
     param.add('tau_targ_%d' % i, expr='tau_targ')

Note that theta_i will vary independently, but rho_i, etc will be constrained to take the value of the variable rho, etc. This gives a full set of parameters per season, but ones meeting your constraints. This approach allows you to easily change one or more of these to vary separately if that looks like it might require more testing. To do that, you could just say:

    params['rho_7'].set(value=2.0, vary=True, expr='')  # vary independently

To use this set of multi-season parameters, you will also need multi-season data. I'm not sure if dist, host_pop and targ_pop are meant to vary with season, or only data. I'll assume that only data changes with season (but it should be easy enough to change if not). Construct a list with data for each season, and then modify your objective function would look something like:

def objective(params, season_data, dist, host_pop, targ_pop):
    pars = params.valuesdict()
    resid = np.zeros((len(season_data), len(season_data[0]))
    for i, data in enumerate(season_data):
        theta = pars['theta_%d' % i]
        rho = pars['rho_%d' % i]
        tau_host = pars['tau_host_%d' % i]
        tau_targ = pars['tau_targ_%d' % i]
        model = grav_season(dist, host_pop, targ_pop,
                            theta, rho, tau_host, tau_targ)
        resid[i, :] = model - data
    return resid.flatten()

result = minimize(objective, params, args=(season_data, dist, host_pop, targ_pop))
report_fit(result.params)

Hope that helps get you started. Again, the main recommendations are:

  1. create a per-season / per-dataset model funciton for the phenomenon you are modeling.
  2. use 'expr' to constrain per-season parameters to take "global values". Though you want all seasons to have the same value, it might be that some parameters might vary linearly with season, or are required to add to some value or something else that means they aren't really independent.
M Newville
  • 7,486
  • 2
  • 16
  • 29
  • Thank you for your detailed answer. I have adjusted my code accordingly. I do however still encounter an error (see edited original post) and I am unsure as how to proceed. – Lorcán Feb 14 '18 at 16:54
  • I don't see any error message. In general, make the simplest possible script that works then build up from there. For example, do you have a script that works for one season? – M Newville Feb 14 '18 at 18:12
  • Sorry the link should been for [this](https://stackoverflow.com/questions/39027346/valueerror-the-input-contains-nan-values-from-lmfit-model-despite-the-input-n) – Lorcán Feb 14 '18 at 19:46
  • When I run the script for just one season it returns `ValueError: The input contains nan values`. I have checked the dataframe for nan values with `variable_data.isnull().values.any()` but it came back `False` – Lorcán Feb 14 '18 at 20:02
  • No, that link is not from your code, it is from a separate issue. Show the code you used and the messages you got. Not the errors someone else got. Of course, you might get nans from exponentiation -- you might want to check for and guard against that. Again: does this work for a single data set? Why did you ignore this suggestion? – M Newville Feb 14 '18 at 20:45
  • Sorry, I did not ignore this suggestion. I had been running it for one season but I had neglected to update the code above. Please see above for the code I am running for one season (the data for every season is in one dataframe - I have been extracting data for single seasons using pandas conditions) and the complete terminal output for the error. Again, my apologies. – Lorcán Feb 14 '18 at 21:19
  • 1. print out the values of the parameters in the objective function -- are those values allowed for exponents? 2. Run your `grav_dist_over_pop` with test input and reasonably expected values. This will also test the conversion of your Pandas data to numpy arrays, and so forth. 3. if all else fails, set `nan_policy=omit` in the call to `minimize()` and see if that helps. Again, start out simple and work your way to more complex: is your data valid? Can you compute your model once? what values for the parameters cause NaNs? – M Newville Feb 14 '18 at 22:50
  • I have managed to successfully implement your advice. I had to constrain my rho and tau values more tightly and it didn't produce any more NaN errors. I have posted the completed code above. Thank you for all of your help! – Lorcán Feb 20 '18 at 16:53