Nonlinear regression with python - what's a simple method to fit this data better?

Question

I have some data that I want to fit so I can make some estimations for the value of a physical parameter given a certain temperature.

I used numpy.polyfit for a quadratic model, but the fit isn't quite as nice as I'd like it to be and I don't have much experience with regression.

I have included the scatter plot and the model provided by numpy: S vs Temperature; blue dots are experimental data, black line is the model

The x axis is temperature (in C) and the y axis is the parameter, which we'll call S. This is experimental data, but in theory S should tends towards 0 as temperature increases and reach 1 as temperature decreases.

My question is: How can I fit this data better? What libraries should I use, what kind of function might approximate this data better than a polynomial, etc?

I can provide code, coefficients of the polynomial, etc, if it's helpful.

Here is a Dropbox link to my data. (Somewhat important note to avoid confusion, although it won't change the actual regression, the temperature column in this data set is Tc - T, where Tc is the transition temperature (40C). I converted this using pandas into T by calculating 40 - x).

Would you please post a link to the data? – James Phillips Aug 22 '18 at 19:15 — James Phillips, Aug 22 '18 at 19:15
@JamesPhillips Added Dropbox link to the OP – Jinx Aug 22 '18 at 19:34 — Jinx, Aug 22 '18 at 19:34

score 23 · Accepted Answer · edited Feb 18 '20 at 04:29

This example code uses an equation that has two shape parameters, a and b, and an offset term (that does not affect curvature). The equation is "y = 1.0 / (1.0 + exp(-a(x-b))) + Offset" with parameter values a = 2.1540318329369712E-01, b = -6.6744890642157646E+00, and Offset = -3.5241299859669645E-01 which gives an R-squared of 0.988 and an RMSE of 0.0085.

The example contains your posted data with Python code for fitting and graphing, with automatic initial parameter estimation using the scipy.optimize.differential_evolution genetic algorithm. The scipy implementation of Differential Evolution uses the Latin Hypercube algorithm to ensure a thorough search of parameter space, and this requires bounds within which to search - in this example code, these bounds are based on the maximum and minimum data values.

import numpy, scipy, matplotlib
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from scipy.optimize import differential_evolution
import warnings

xData = numpy.array([19.1647, 18.0189, 16.9550, 15.7683, 14.7044, 13.6269, 12.6040, 11.4309, 10.2987, 9.23465, 8.18440, 7.89789, 7.62498, 7.36571, 7.01106, 6.71094, 6.46548, 6.27436, 6.16543, 6.05569, 5.91904, 5.78247, 5.53661, 4.85425, 4.29468, 3.74888, 3.16206, 2.58882, 1.93371, 1.52426, 1.14211, 0.719035, 0.377708, 0.0226971, -0.223181, -0.537231, -0.878491, -1.27484, -1.45266, -1.57583, -1.61717])
yData = numpy.array([0.644557, 0.641059, 0.637555, 0.634059, 0.634135, 0.631825, 0.631899, 0.627209, 0.622516, 0.617818, 0.616103, 0.613736, 0.610175, 0.606613, 0.605445, 0.603676, 0.604887, 0.600127, 0.604909, 0.588207, 0.581056, 0.576292, 0.566761, 0.555472, 0.545367, 0.538842, 0.529336, 0.518635, 0.506747, 0.499018, 0.491885, 0.484754, 0.475230, 0.464514, 0.454387, 0.444861, 0.437128, 0.415076, 0.401363, 0.390034, 0.378698])


def func(x, a, b, Offset): # Sigmoid A With Offset from zunzun.com
    return  1.0 / (1.0 + numpy.exp(-a * (x-b))) + Offset


# function for genetic algorithm to minimize (sum of squared error)
def sumOfSquaredError(parameterTuple):
    warnings.filterwarnings("ignore") # do not print warnings by genetic algorithm
    val = func(xData, *parameterTuple)
    return numpy.sum((yData - val) ** 2.0)


def generate_Initial_Parameters():
    # min and max used for bounds
    maxX = max(xData)
    minX = min(xData)
    maxY = max(yData)
    minY = min(yData)

    parameterBounds = []
    parameterBounds.append([minX, maxX]) # search bounds for a
    parameterBounds.append([minX, maxX]) # search bounds for b
    parameterBounds.append([0.0, maxY]) # search bounds for Offset

    # "seed" the numpy random number generator for repeatable results
    result = differential_evolution(sumOfSquaredError, parameterBounds, seed=3)
    return result.x

# generate initial parameter values
geneticParameters = generate_Initial_Parameters()

# curve fit the test data
fittedParameters, pcov = curve_fit(func, xData, yData, geneticParameters)

print('Parameters', fittedParameters)

modelPredictions = func(xData, *fittedParameters) 

absError = modelPredictions - yData

SE = numpy.square(absError) # squared errors
MSE = numpy.mean(SE) # mean squared errors
RMSE = numpy.sqrt(MSE) # Root Mean Squared Error, RMSE
Rsquared = 1.0 - (numpy.var(absError) / numpy.var(yData))
print('RMSE:', RMSE)
print('R-squared:', Rsquared)



##########################################################
# graphics output section
def ModelAndScatterPlot(graphWidth, graphHeight):
    f = plt.figure(figsize=(graphWidth/100.0, graphHeight/100.0), dpi=100)
    axes = f.add_subplot(111)

    # first the raw data as a scatter plot
    axes.plot(xData, yData,  'D')

    # create data for the fitted equation plot
    xModel = numpy.linspace(min(xData), max(xData))
    yModel = func(xModel, *fittedParameters)

    # now the model as a line plot 
    axes.plot(xModel, yModel)

    axes.set_xlabel('X Data') # X axis data label
    axes.set_ylabel('Y Data') # Y axis data label

    plt.show()
    plt.close('all') # clean up after using pyplot

graphWidth = 800
graphHeight = 600
ModelAndScatterPlot(graphWidth, graphHeight)

It seems that scipy.optimize.curve_fit struggles when values are incredibly small or incredible large. I scaled some of my data to get a better fit (was getting a straight line at one point). Do you know why this is the case? — Jinx, Aug 24 '18 at 21:27
I know that in such cases the starting parameters can make a big difference - in my understanding this does not happen when fitting data to a simple straight line, for example. When I fit this specific equation using my open source zunzun.com web site, the genetic algorithm's random population is drawn from a set of large, medium and small numbers partly for this reason. You can directly try the web site version of this equation here http://zunzun.com/Equation/2/Sigmoidal/Sigmoid%20A%20With%20Offset/ to see if scaling the data is needed when using that custom version of the genetic algorithm. — James Phillips, Aug 25 '18 at 01:04
@MahamuthaM - Use the equation and parameters that I provided. — James Phillips, Jun 24 '19 at 12:26

score 5 · Answer 2 · answered Aug 22 '18 at 19:51

5

I would suggest checking out scipy. They have a non-linear optimizer for fitting data to arbitrary functions. See the documentation for scipy.optimize.curve_fit here. Be aware that the more complex the function, the longer it will take to fit.

answered Aug 22 '18 at 19:51

PMende

5,171
2
19
26

1

Based on your recommendation of using scipy, I think you would find my answer below interesting. – James Phillips Aug 22 '18 at 21:55

score 3 · Answer 3 · answered Aug 22 '18 at 20:05

3

For non-linear regression problem, you could try SVR(), KNeighborsRegressor() or DecisionTreeRegression() from sklearn, and compare the model performance on the test set.

answered Aug 22 '18 at 20:05

Sunny Liu

79
1
3

score 3 · Answer 4 · answered May 08 '20 at 14:32

3

In Scikit Learn, you can use Polynomial Features to first transform your training data to have more degrees of freedom. After that, you can use Ridge Regression to fit your training data.

answered May 08 '20 at 14:32

Tarun Pratap

51
5

score 2 · Answer 5 · answered Jun 05 '20 at 05:12

Here are a few options for creating a mathematical expression from your data:

Nonlinear regression adjusts parameters in a single equation
Interpolation such as linear or cubic-spline
Empirical regression such as deep learning

I created a script with Python gekko to demonstrate each of these.

from gekko import GEKKO
import numpy as np
import matplotlib.pyplot as plt

xData = np.array([19.1647,18.0189,16.955,15.7683,14.7044,13.6269,12.604,\
                  11.4309,10.2987,9.23465,8.1844,7.89789,7.62498,7.36571,\
                  7.01106,6.71094,6.46548,6.27436,6.16543,6.05569,5.91904,\
                  5.78247,5.53661,4.85425,4.29468,3.74888,3.16206,2.58882,\
                  1.93371,1.52426,1.14211,0.719035,0.377708,0.0226971,\
                  -0.223181,-0.537231,-0.878491,-1.27484,-1.45266,-1.57583,\
                  -1.61717])
yData = np.array([0.644557,0.641059,0.637555,0.634059,0.634135,0.631825,\
                  0.631899,0.627209,0.622516,0.617818,0.616103,0.613736,\
                  0.610175,0.606613,0.605445,0.603676,0.604887,0.600127,\
                  0.604909,0.588207,0.581056,0.576292,0.566761,0.555472,\
                  0.545367,0.538842,0.529336,0.518635,0.506747,0.499018,\
                  0.491885,0.484754,0.47523,0.464514,0.454387,0.444861,\
                  0.437128,0.415076,0.401363,0.390034,0.378698])


m = GEKKO(remote=False)

# nonlinear regression
a,b,c = m.Array(m.FV,3,value=0,lb=-10,ub=10)
x = m.MV(xData); y = m.CV(yData)
a.STATUS=1; b.STATUS=1; c.STATUS=1; y.FSTATUS=1
m.Equation(y==1.0/(1.0+m.exp(-a*(x-b)))+c)

# cubic spline
z = m.Var()
m.cspline(x,z,xData,yData,True)

m.options.IMODE = 2; m.options.EV_TYPE = 2
m.solve()

# stats (from other answer)
absError = y.value - yData
SE = np.square(absError) # squared errors
MSE = np.mean(SE) # mean squared errors
RMSE = np.sqrt(MSE) # Root Mean Squared Error, RMSE
Rsquared = 1.0 - (np.var(absError) / np.var(yData))
print('RMSE:', RMSE)
print('R-squared:', Rsquared)
print('Parameters', a.value[0], b.value[0], c.value[0])

# deep learning
from gekko import brain
b = brain.Brain()
b.input_layer(1)
b.layer(linear=1)
b.layer(tanh=2)
b.layer(linear=1)
b.output_layer(1)
b.learn(xData,yData,obj=1,disp=False) # train
xp = np.linspace(min(xData),max(xData),100)
w = b.think(xp) # predict

plt.plot(xData,yData,'k.',label='data')
plt.plot(x.value,y.value,'r:',lw=3,label=r'$1/(1+exp(-a(x-b)+c)$')
plt.plot(x.value,z.value,'g--',label='c-spline')
plt.plot(xp,w[0],'b-.',label='deep learning')
plt.legend(); plt.show()

The regression results are:

RMSE: 0.008428738368115708
R-squared: 0.988622263162808
Parameters 0.2154031832 -6.6744890468 -0.3524129987

The deep learning is similar to the single regression equation but the layers and activation functions are more easily adjusted than creating an equation form yourself. The advantage of the single equation is that it may extrapolate better than a machine learned model. Interpolation such as a piecewise linear or cubic-spline function may be good if you don't need to extrapolate and there is little variability in individual data points. Here is more information that I created on regression and interpolation with more examples in the Jupyter notebooks.

Kevin Koehler · Answer 6 · 2018-08-22T19:47:59.350

0

Try a support vector machine with a polynomial kernel.

With scikit-learn, fitting a model can be as simple as:

from sklearn.svm import SVC
#... load the data into X,y
model = SVC(kernel='poly')
model.fit(X,y)
#plot the model...

edited Aug 22 '18 at 19:47

answered Aug 22 '18 at 18:07

Kevin Koehler

144
6

So, I gave this a shot, and I first had to reshape my x values into a 2D array using np.reshape, but now I'm getting the error "Unknown label type: 'continuous'" because my data are floats. Seems like there's a few posts on StackOverflow on how to get around this. – Jinx Aug 22 '18 at 19:13
1

Oops! Actually what you should be using is a SVM with polynomial kernel. I'll update my answer. The problem is that the LogisticRegression model I linked is a classifier and not a regressor! – Kevin Koehler Aug 22 '18 at 19:46
2

@Jinx that's because this SVC is a classifier. you should use sklearn's SVR instead which is for regression. – Osama Dar Sep 30 '19 at 09:27

score 0 · Answer 7 · answered May 04 '21 at 06:09

A more robust nonlinear optimization can be obtained by mitigating the leverage of outliers using sublinear loss function like soft l1.

https://scipy-cookbook.readthedocs.io/items/robust_regression.html

Also, any domain knowledge of the solution bound may avoid the use of the genetic generation of initial parameters. Especially relevant if the features have not been scaled.

Nonlinear regression with python - what's a simple method to fit this data better?

7 Answers7

Linked