Find the appropriate polynomial fit for data in Python

Question

Is there a function or library in Python to automatically compute the best polynomial fit for a set of data points? I am not really interested in the ML use case of generalizing to a set of new data, I am just focusing on the data I have. I realize the higher the degree, the better the fit. However, I want something that penalizes or looks at where the error elbows? When I say elbowing, I mean something like this (although usually it is not so drastic or obvious):

One idea I had was to use Numpy's polyfit: https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.polyfit.html to compute polynomial regression for a range of orders/degrees. Polyfit requires the user to specify the degree of polynomial, which poses a challenge because I don't have any assumptions or preconceived notions. The higher the degree of fit, the lower the error will be but eventually it plateaus like the image above. Therefore if I want to automatically compute the degree of polynomial where the error curve elbows: if my error is E and d is my degree, I want to maximize (E[d+1]-E[d]) - (E[d+1] - E[d]).

Is this even a valid approach? Are there other tools and approaches in well-established Python libraries lik Numpy or Scipy that can help with finding the appropriate polynomial fit (without me having to specify the order/degree)? I would appreciate any thoughts or suggestions! Thanks!

You want to somewhat minimize RSS but also penalize overfitting? I would try a handful of different splines and go by inspection. Can you analytically define your cost function for 'elbowing'? That'd help. — charley, Jan 06 '19 at 23:33
I guess so. I was given the task of finding the "best" or "most appropriate" polynomial fit for an arbitrary set of data points. Since the problem was vague, I felt the need to impose conditions and go from there (although please provide input if you think of something else). I was leaning towards MSE because that is what I am most familiar with, but I again the cost function can be changed. Could you also elaborate on the splines approach you suggested? Thank you! — Jane Sully, Jan 06 '19 at 23:41
I was thinking about how within the same degree you can have different kinds of splines, like natural and clamped. With more fits per degree, I was hoping that elbow would be more gradual, an even better model lurking somewhere in the gap. The tradeoff would be finding a great fit versus more complicated implementation. — charley, Jan 06 '19 at 23:56
Do you have some idea of the maximum curvature to be expected from such an automated fit of the data - that is, do you have some ide from the expected data what might be an overfitted polynomial based on the derivative? — James Phillips, Jan 07 '19 at 01:39
If you have elbows I suggest using a spline instead of a polynom (A spline is essentially a piecewise polynom). See [spline definition](https://en.wikipedia.org/wiki/Spline_(mathematics)) and [scipy splines](https://docs.scipy.org/doc/scipy/reference/tutorial/interpolate.html#id5).You can also try different polyfits and compare error vs. degree. Another approach is to use multivariate regression against different powers of `x` — Tarifazo, Jan 07 '19 at 12:40

score 1 · Answer 1 · answered Jan 08 '19 at 08:15

To select the "right" fit and prevent over-fitting, you can use the Akiake Information Criterion or the Bayesian Information Criterion. Note that your fitting procedure can be non-Bayesian and you can still use these to compare fits. Here is a quick comparison between the two methods.

Find the appropriate polynomial fit for data in Python

1 Answers1