3

I'm modeling measurement errors in a certain measuring device. This is how the data looks: high frequency sine ripples on a low frequency polynomial. My model should capture the ripples too.

The curve that fits the error should be of the form: error(x) = a0 + a1*x + a2*x^2 + ... an*x^n + Asin(x/lambda). The order n of the polynomial is not known. My plan is to iterate n from 1-9 and select the one that has the highest F-value.

I've played with numpy.polyfit and scipy.optimize.curve_fit so far. numpy.polyfit is only for polynomials, so while I can generate the "best fit" polynomial, there's no way to determine the parameters A and lambda for the sine term. scipy.optimize.curve_fit would have worked great if I already knew the order of the polynomial for the polynomial part of error(x).

Is there a clever way to use both numpy.polyfit and scipy.optimize.curve_fit to get this done? Or another library-function perhaps?

Here's the code for how I'm using numpy.polyfit to select the best polynomial:

def GetErrorPolynomial(X, Y):

    maxFval = 0.0
    for i in range(1, 10):   # i is the order of the polynomial (max order = 9)
        error_func = np.polyfit(X, Y, i)
        error_func = np.poly1d(error_func)

        # F-test (looking for the largest F value)
        numerator = np.sum(np.square(error_func(X) - np.mean(Y))) / i
        denominator = np.sum(np.square(Y - error_func(X))) / (Y.size - i - 1)
        Fval = numerator / denominator

        if Fval > maxFval:
            maxFval = Fval
            maxFvalPolynomial = error_func

    return maxFvalPolynomial

And here's the code for how I'm using curve_fit:

def poly_sine_fit(x, a, b, c, d, l):
     return a*np.square(x) + b*x + c + d*np.sin(x/l)

param, _ = curve_fit(poly_sine_fit, x_data, y_data)

It's "hardcoded" to a quadratic function, but I want to select the "best" order as I'm doing above with np.polyfit

MorganStark47
  • 154
  • 2
  • 15
  • this answer may help: https://stackoverflow.com/a/50732144/5629339 – filippo Sep 09 '19 at 11:39
  • also lmfit has some nice models which you can easily combine: https://lmfit.github.io/lmfit-py/builtin_models.html#polynomialmodel – filippo Sep 09 '19 at 11:40
  • Would you please post, or link to, the data? – James Phillips Sep 09 '19 at 15:07
  • @JamesPhillips added a plot of the data in the post – MorganStark47 Sep 12 '19 at 10:26
  • You can create a function that includes the exponent as a fitting parameter. You'd have to round it to an integer inside the function (so `curve_fit` will not find any improvements between `n = 1.2` or `n = 1.3`), and you'll have to bound this parameter between 1 and 9 (inclusive); I think `curve_fit` includes that option. – 9769953 Sep 12 '19 at 12:06
  • 1
    Looking at your graph, it looks as if another type of function is more appropriate. A chi-square function or one of its relatives may work better, perhaps with a small Gaussian added to account for the little bump (though that may just be noise). But this depends on the underlying model, which is domain knowledge we don't have. Otherwise, a spline would also fit your overall function, if you just want something to smoothly fit the overall shape, without meaning anything in particular. – 9769953 Sep 12 '19 at 12:08
  • @00 Please see my answer to this question where an Extreme Value peak equation appears to be a good candidate function to model the underlying data. – James Phillips Sep 12 '19 at 14:02

3 Answers3

3

I finally found a way to model the ripples and can answer my own question. This 2006 paper does curve-fitting on ripples that resemble my dataset.

First off, I did a least squares polynomial fit and then subtracted this polynomial curve from the original data. This left me with only the ripples. Applying the Fourier transform, I picked out the dominant frequencies which let me reconstruct the sine ripples. Then I simply added these ripples to the polynomial curve I had obtained in the beginning. That did it.

MorganStark47
  • 154
  • 2
  • 15
0

Use Scikit-learn Linear Regression

Here is a code sample I used to perform a linear regression with a polynom of degree 3 that pass by the point 0 with value 1 and null derivative. You just have to adapt the function create_vector with the function you want.

from sklearn import linear_model
import numpy as np

def create_vector(x):
    # currently representing a polynom Y = a*X^3 + b*X^2
    x3 = np.power(x, 3)
    x2 = np.power(x, 2)
    X = np.append(x3, x2, axis=1)
    return X 

data_x = [some_data_input]
data_y = [some_data_output]

x = np.array(data_x).reshape(-1, 1)
y_data = np.array(data_y).reshape(-1, 1)-1 # -1 to pass by the point (0,1)

X = create_vector(x)    

regr = linear_model.LinearRegression(fit_intercept=False)
regr.fit(X, y_data)
Coding thermodynamist
  • 1,340
  • 1
  • 10
  • 18
  • "just have to adapt the function create_vector with the function you want" -- that's the thing, I don't know what should the function be. That is the point of my post, how do I model sine ripples on a polynomial function when I don't know the order of the polynomial beforehand. – MorganStark47 Sep 17 '19 at 01:56
  • You implement your cross-validation to try different polynomial degree. – Coding thermodynamist Sep 17 '19 at 15:44
0

I extracted data from the scatterplot for analysis and found that a polynomial + sine did not seem to be an optimal model, because lower order polynomials were not following the shape of the data very well and higher order polynomials were exhibiting Runge's phenomenon of high curvature at the data extremes. I performed an equation search to find what the high-frequency sine wave might be imposed upon, and a good candidate seemed to be the Extreme Value peak equation "a * exp(-1.0 * exp(-1.0 * ((x-b)/c))-((x-b)/c) + 1.0) + offset" as shown below.

Here is a graphical Python curve fitter for this equation, at the top of the file I load the data I had extracted so you would need to replace this with the actual data. This fitter uses scipy's differential_evolution genetic algorithm module to estimate initial parameter values for the non-linear fitter, which uses the Latin Hypercube algorithm to ensure a thorough search of parameter space and requires bounds within which to search. Here those bounds are taken from the data maximum and minimum values.

Subtracting the model predictions from this fitted curve should leave you with only the sine component to be modeled. I noted that there seems to be an additional narrow, low-amplitude peak at approximately x = 275.

plot

import numpy, scipy, matplotlib
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from scipy.optimize import differential_evolution
import warnings

##########################################################
# load data section
f = open('/home/zunzun/temp/temp.dat')
textData = f.read()
f.close()

xData = []
yData = []
for line in textData.split('\n'):
    if line: # ignore blank lines
        spl = line.split()
        xData.append(float(spl[0]))
        yData.append(float(spl[1]))
xData = numpy.array(xData)
yData = numpy.array(yData)


##########################################################
# model to be fitted
def func(x, a, b, c, offset): # Extreme Valye Peak equation from zunzun.com
    return a * numpy.exp(-1.0 * numpy.exp(-1.0 * ((x-b)/c))-((x-b)/c) + 1.0) + offset


##########################################################
# fitting section

# function for genetic algorithm to minimize (sum of squared error)
def sumOfSquaredError(parameterTuple):
    warnings.filterwarnings("ignore") # do not print warnings by genetic algorithm
    val = func(xData, *parameterTuple)
    return numpy.sum((yData - val) ** 2.0)


def generate_Initial_Parameters():
    # min and max used for bounds
    maxX = max(xData)
    minX = min(xData)
    maxY = max(yData)
    minY = min(yData)

    minData = min(minX, minY)
    maxData = max(maxX, maxY)

    parameterBounds = []
    parameterBounds.append([minData, maxData]) # search bounds for a
    parameterBounds.append([minData, maxData]) # search bounds for b
    parameterBounds.append([minData, maxData]) # search bounds for c
    parameterBounds.append([minY, maxY]) # search bounds for offset

    # "seed" the numpy random number generator for repeatable results
    result = differential_evolution(sumOfSquaredError, parameterBounds, seed=3)
    return result.x

# by default, differential_evolution completes by calling curve_fit() using parameter bounds
geneticParameters = generate_Initial_Parameters()

# now call curve_fit without passing bounds from the genetic algorithm,
# just in case the best fit parameters are aoutside those bounds
fittedParameters, pcov = curve_fit(func, xData, yData, geneticParameters)
print('Fitted parameters:', fittedParameters)
print()

modelPredictions = func(xData, *fittedParameters) 

absError = modelPredictions - yData

SE = numpy.square(absError) # squared errors
MSE = numpy.mean(SE) # mean squared errors
RMSE = numpy.sqrt(MSE) # Root Mean Squared Error, RMSE
Rsquared = 1.0 - (numpy.var(absError) / numpy.var(yData))

print()
print('RMSE:', RMSE)
print('R-squared:', Rsquared)

print()


##########################################################
# graphics output section
def ModelAndScatterPlot(graphWidth, graphHeight):
    f = plt.figure(figsize=(graphWidth/100.0, graphHeight/100.0), dpi=100)
    axes = f.add_subplot(111)

    # first the raw data as a scatter plot
    axes.plot(xData, yData,  'D')

    # create data for the fitted equation plot
    xModel = numpy.linspace(min(xData), max(xData))
    yModel = func(xModel, *fittedParameters)

    # now the model as a line plot
    axes.plot(xModel, yModel)

    axes.set_xlabel('X Data') # X axis data label
    axes.set_ylabel('Y Data') # Y axis data label

    plt.show()
    plt.close('all') # clean up after using pyplot

graphWidth = 800
graphHeight = 600
ModelAndScatterPlot(graphWidth, graphHeight)

UPDATE ------- If the high-frequency sine component is constant (which I do not know) then modeling a small portion of the data with only a few cycles will be sufficient to determine the equation and initial parameter estimates for fitting the sine wave portion of the model. Here I have done this with the following result:

sine

from the following equation:

amplitude = -1.0362957093184177E+00
center = 3.6632754608370377E+01
width = 5.0813421718648293E+00
Offset = 5.1940843481496088E+00

pi = 3.14159265358979323846 # constant not fitted

y = amplitude * sin(pi * (x - center) / width) + Offset

Combining these two models using the actual data, rather than my scatterplot-extracted data, should be close to what you need.

James Phillips
  • 4,526
  • 3
  • 13
  • 11
  • "a polynomial + sine did not seem to be an optimal model" - yep, you're right about that. Mathematically, the "best" curve should just ignore the ripples. However, for the project I'm doing this error modeling for, the sine ripples need to be captured by the curve. Thanks anyway though. – MorganStark47 Sep 17 '19 at 02:26
  • Please see my edit for modeling the sine wave portion of the data, it is marked as "UPDATE" in my answer. – James Phillips Sep 17 '19 at 03:25
  • Thanks for all the suggestions, I found out a solution and posted it as an answer – MorganStark47 Oct 27 '19 at 13:01