0

I have a script in which I take a dataframe, which looks something like this:

enter image description here

and convert some columns to numpy arrays for processing. I then use a small function that I wrote which uses statsmodels.api to compute a linear regression based on two arrays that I pass into the function. The function then returns the statistics and the linear fit equation:

def computeLinearStats(x, y, yName, calc_tau = False):
    '''
    Takes as an argument two numpy arrays, one for x and one y, and a string for the
    name of the y-variable, and a boolean for whether to calculate tau.
    Uses Ordinary Least Squares to compute the statistical parameters for the
    array against log(z), and determines the equation for the line of best fit.
    Returns the results summary, residuals, statistical parameters in a list,
    the best fit equation, and Kendall's tau.
    '''

    #   Mask NaN values in both axes
    mask = ~np.isnan(y) & ~np.isnan(x)
    #   Compute model parameters
    model = sm.OLS(y[mask], sm.add_constant(x[mask]), missing= 'drop')
    results = model.fit()
    residuals = results.resid
    if calc_tau:
        tau = stats.kendalltau(x, y, nan_policy= 'omit')
    else:
        tau = [1, 1]    #   Use this to exclude computation of tau
#    

    #   Compute fit parameters
    params = stats.linregress(x[mask], y[mask])
    fit = params[0]*x + params[1]
    fitEquation = '$(%s)=(%.4g \pm %.4g) \\times log_{10}(redshift)+%.4g$'%(yName,
                    params[0],  #   slope
                    params[4],  #   stderr in slope
                    params[1])  #   y-intercept
    return results, residuals, params, fit, fitEquation, tau

For example, say I'm looking for a linear fit between loz(z) and 'B-I' from the dataframe. After calculating these variables, I would call

results, residuals, params, fit, equation, tau = qf.computeLinearStats(log_z, (B-I), 'B-I', calc_tau = False)

to get the linear fit.

Everything works fine, but now I need to fit a polynomial rather than a linear fit.

I've tried

sources['log_z'] = np.log10(sources.z)
mask = ~np.isnan((B-I)) & ~np.isnan(log_z)
model = ols(formula='(B-I) ~ log_z', data = [log_z[mask], (B-I) 
[mask]]).fit()

and

model = ols(formula='(B-I) + np.power((U-R),2) ~ log_z', data = [log_z[mask], (B-I)[mask]]).fit()

But I get

PatsyError: Error evaluating factor: TypeError: list indices must be integers or slices, not str
    (B-I) ~ log_z
            ^^^^^

even though both x and y are arrays, not strings.

What's the easiest way to find a polynomial fit in this situation -- say, something like (B-I) + (U-R)**2 against log(z)? Slide 41 and onwards on this site seems to be a starting point, but I'm confused about how to apply it.

feedMe
  • 3,431
  • 2
  • 36
  • 61
Jim421616
  • 1,434
  • 3
  • 22
  • 47
  • In the Patsy notation, I think it's looking for a column `'log_z'` which doesn't exist. I believe you'd need something like `'(B-I) ~ np.log(z)'`. I had issues in the past with squaring things, but your other formula could be `'(B-I) + np.power((U-R),2) ~ np.log(z)'` – ALollz Dec 13 '18 at 22:11
  • Sorry, I left out my declaration of log(z) because that's not what's causing the problem. I've added a definition for it in the code snippet now. – Jim421616 Dec 13 '18 at 22:38
  • Check [this post](https://stackoverflow.com/questions/11479064/multiple-linear-regression-in-python). You can either do a multiple linear regression where your independant variables are `x, x**2, x**3 ...`or use `scipy.optimize.curve_fit` – Tarifazo Dec 17 '18 at 15:49

0 Answers0