How do I do multivariate non-linear regression in Python?

Question

Let's say my actual equation is y = a * b + c

So my data set looks like

And so forth. What module do I use in order to have an output that tells me "y = a * b + c"? Is this even possible?

How about y = a * a + b? Any pointers to documentation or explanation of what I should try would be great.

Edit:

The duplicate is clearly a different scenario. In that example there is a single formula that describes a line; in my example it is many variables that mostly fit a result. That other one does not talk about squared terms.

Possible duplicate of [How to solve a pair of nonlinear equations using Python?](https://stackoverflow.com/questions/8739227/how-to-solve-a-pair-of-nonlinear-equations-using-python) — G. Anderson, Apr 29 '19 at 20:48
Did you try hitting the exact same question into the google search bar? — Adarsh Chavakula, Apr 29 '19 at 21:00
I did and I thought I didn't find anything. Did I miss it? Please just point me to it, I'm not good on the google. Cheers. — Sebastian, Apr 29 '19 at 21:18
@G.Anderson: That link doesn't fit this question, as OP has already pointed out. — Prune, Apr 29 '19 at 23:40
This is a topic I see 2-3 times a year on the Python group. It's hard to search, hard to answer, and I haven't been able to find the previous references to close this as a duplicate. — Prune, Apr 30 '19 at 18:12

Prune · Accepted Answer · 2019-05-01T21:03:02.497

There is no module. Your general problem is "what simple function best fits this data?" There is no general solution, as "simple" requires proper definition and restriction to yield a meaningful answer.

A basic theorem of algebra shows that a data set on N points can be fitted by a polynomial of degree no more than N-1. Restricting more than this requires that you define search space and explore within that definition.

Yes, there exist methods to set a maximum degree and work within that; you can write a loop to increase that degree until you find an exact solution.

I suggest that you look at the curve-fitting methods of Scikit and employ those in a solution of your own devising. You may need to work through all combinations of your chosen degree, adding new terms each time you increase the degree. You may also need to write the exploration to consider those terms in the order of your defined complexity.

Response to OP comment:

I see; you're somewhat following in the footsteps of FiveThirtyEight.com, best known for accuracy with baseball and elections in the USA. Depending on the accuracy you want, this problem gets nasty very quickly. You get terms such as ((MY_OFF-OPP_DEF) ^ 1.28 + 2.1 - sqrt(OPP_GK)) / BLAH.

In any case, you're likely into a deep learning regression application, somewhat more complex than a "simple" sum-of-products scenario. You might get acceptable results with "mere" machine learning, but be prepared for disappointment in even the simpler task of predicting the winner.

Thanks for the reply (and for the edits on my OP), Prune. I have played a bit with scikit-learn machine learning after finding this page (https://towardsdatascience.com/machine-learning-with-python-easy-and-robust-method-to-fit-nonlinear-data-19e8a1ddbd49). For transparency's sake, I'm playing with hockey statistics and trying to take various inputs and model how closely they correlate with goals, and which are most predictive of goals. It isn't a simple linear relationship, there are various degrees of impact these metrics have on goals, so a machine learning pipeline seemed to do the trick. — Sebastian, May 01 '19 at 19:10
Yes, it is fairly complex and basically impossible to predict, and hockey is even worse because goals are a very rare event and as such a team who 'plays better' may not always win, because the other team just gets a couple of lucky bounces. In the long run these scenarios average out and the better teams usually win. So I'm just hoping to get python/machine learning to do the legwork for me :) I certainly know it's not possible to predict winners with even 60% certainty, or else whoever figured that out would be rich from betting. Thanks again! I appreciate the helpful insight :) — Sebastian, May 02 '19 at 15:39

score 1 · Answer 2 · edited May 06 '20 at 14:39

Have you thought about giving the scikit-learn Gradient Boosting Regressor a try? Please refer to the user guide for code examples of how this method can be used on regression problems.

Please also note that the documentation states that

scikit-learn 0.21 introduces two new experimental implementations of gradient boosting trees, namely HistGradientBoostingClassifier and HistGradientBoostingRegressor, inspired by LightGBM. These histogram-based estimators can be orders of magnitude faster than GradientBoostingClassifier and GradientBoostingRegressor when the number of samples is larger than tens of thousands of samples.

How do I do multivariate non-linear regression in Python?

2 Answers2