23

I have a bunch of data, generally in the form a, b, c, ..., y

where y = f(a, b, c...)

Most of them are three and four variables, and have 10k - 10M records. My general assumption is that they are algebraic in nature, something like:

y = P1 a^E1 + P2 b^E2 + P3 c^E3

Unfortunately, my last statistical analysis class was 20 years ago. What is the easiest way to get a good approximation of f? Open source tools with a very minimal learning curve (i.e. something where I could get a decent approximation in an hour or so) would be ideal. Thanks!

user64258
  • 231
  • 1
  • 2
  • 4

6 Answers6

14

In case it's useful, here's a Numpy/Scipy (Python) template to do what you want:

from numpy import array
from scipy.optimize import leastsq

def __residual(params, y, a, b, c):
    p0, e0, p1, e1, p2, e2 = params
    return p0 * a ** e0 + p1 * b ** e1 + p2 * c ** e2 - y

# load a, b, c
# guess initial values for p0, e0, p1, e1, p2, e2
p_opt = leastsq(__residual,  array([p0, e0, p1, e1, p2, e2]), args=(y, a, b, c))
print 'y = %f a^%f + %f b^%f %f c^%f' % map(float, p_opt)

If you really want to understand what's going on, though, you're going to have to invest the time to scale the learning curve for some tool or programming environment - I really don't think there's any way around that. People don't generally write specialized tools for doing things like 3-term power regressions exclusively.

Brendan
  • 18,771
  • 17
  • 83
  • 114
David Z
  • 128,184
  • 27
  • 255
  • 279
  • 1
    scipy.odr (orthogonal distance regression) could be useful if a, b, c don't have infinite precision (least square assumes infinite precision for coordinates). – jfs Feb 10 '09 at 17:19
  • Surely the function requires some sample output to minimise towards i.e. some sample y values given a set of a, b, c values? – Brendan Jan 23 '10 at 19:34
5

I spent over a week trying to do essentially the same thing. I tried a whole bunch of optimization stuff to fine tune the coefficients with basically no success, then I found out that there is a closed form solution and it works really well.

Disclaimer: I was trying to fit data with a fixed maximum order of magnitude. If there is no limit to your E1, E2, etc values, then this won't work for you.

Now that I've taken the time to learn this stuff, I actually see that some of the answers would have given good hints if I understood them. It had also been a while since my last statistics and linear algebra class.

So if there are other people out there who are lacking the linear algebra knowledge, here's what I did.

Even though this is not a linear function you are trying to fit, it turns out that this is still a linear regression problem. Wikipedia has a really good article on linear regression. I recommend reading it slowly: https://en.wikipedia.org/wiki/Linear_regression#:~:text=In%20statistics%2C%20linear%20regression%20is,as%20dependent%20and%20independent%20variables). It also links a lot of other good related articles.

If you don't know how to do a simple (single variable) linear regression problem using matrices, take some time to learn how to do that.

Once you learn how to do simple linear regression, then try some multivariable linear regression. Basically, to do multi variable linear regression, you create an X matrix where there is a row for each of your input data items and each row contains all of the variable values for that data entry (plus a 1 in the last column which is used for the constant value at the end of your polynomial (called an intercept)). Then you create a Y matrix that is a single column with a row for each data item. Then you solve B = (XTX)-1XTY. B then becomes all of the coefficients for your polynomial.

For multi-variable polynomial regression, its the same idea, just now you have a huge multi-variable linear regression where each regressor (variable you're doing regression on) is a coefficient for your giant polynomial expression.

So if your input data looks like this:

Inputs Output
a1, b1, y1
a2, b2, y2
... ...
aN, bN, yN

And you want to fit a 2nd order polynomial of the form y = c1a^2b^2 + c2a^2b + c3a^2 + c4ab^2 + c5ab + c6a + c7b^2 + c8b + c9, then your X matrix will look like:

a1^2*b1^2 a1^2*b1 a1^2 a1*b1^2 a1*b1 a1 b1^2 b1 1
a2^2*b2^2 a2^2*b2 a2^2 a2*b1^2 a2*b2 a2 b2^2 b2 1
... ... ... ... ... ... ... ... ...
aN^2*bN^2 aN^2*bN aN^2 aN*bN^2 aN*bN aN bN^2 bN 1

Your Y matrix is simply:

y1
y2
...
yN

Then you do B = (XTX)-1XTY and then B will equal

c1
c2
c3
c4
c5
c6
c7
c8
c9

Note that the total number of coefficients will be (o + 1)V where o is the order of the polynomial and V is the number of variables, so it grows pretty quickly.

If you are using good matrix code, then I believe the runtime complexity will be O(((o+1)V)3 + ((o + 1)V)2N), where V is the number of variables, o is the order of the polynomial, and N is the number of data inputs you have. Initially this sounds pretty terrible, but in most cases, o and V are probably not going to be high, so then this just becomes linear with respect to N.

Note that if you are writing your own matrix code, then it is important to make sure that your inversion code uses something like an LU decomposition. If you use a naïve inversion approach (like I did at first) then that ((o+1)V)3 becomes ((o+1)V)!, which was way worse. Before I made that change, I predict that my 5th order 3 variable polynomial would take roughly 400 google millennia to complete. After using LU decomposition, it takes about 7 seconds.

Another disclaimer

This approach requires that (XTX) not be a singular matrix (in other words, it can be inverted). My linear algebra is a little rough so I don't know all of the cases where that would occur, but I know that it occurs when there is perfect multi-collinearity between input variables. That means one variable is just another factor multiplied by a constant (for example, one input is number of hours to complete a project and another is dollars to complete a project, but the dollars are just based on an hourly rate times the number of hours).

The good news is that when there is perfect multi-collinearity, you'll know. You'll end up with a divide by zero or something when you are inverting the matrix.

The bigger problem is when you have imperfect multi-collinearity. That's when you have two closely related but not perfectly related variables (such as temperature and altitude, or speed and mach number). In those cases, this approach still works in theory, but it becomes so sensitive that small floating point errors can cause the result to be WAY off.

In my observations, however, the result is either really good or really bad, so you could just set some threshold on your mean squared error and if its over that, then say "couldn't compute a polynomial".

NateW
  • 908
  • 1
  • 8
  • 28
3

The basics of data fitting involve assuming a general form of a solution, guessing some initial values for constants, and then iterating to minimize the error of the guessed solution to find a specific solution, usually in the least-squares sense.

Look into R or Octave for open source tools. They are both capable of least-squares analysis, with several tutorials just a Google search away.

Edit: Octave code for estimating the coefficients for a 2nd order polynomial

x = 0:0.1:10;
y = 5.*x.^2 + 4.*x + 3;

% Add noise to y data
y = y + randn(size(y))*0.1;

% Estimate coefficients of polynomial
p = polyfit(x,y,2)

On my machine, I get:

ans =

   5.0886   3.9050   2.9577
Scottie T
  • 11,729
  • 10
  • 45
  • 59
  • Thanks, I have...that's why I said "very minimal learning curve"! Those are both excellent general purpose statistical languages, but have a pretty hefty learning curve (IMHO). – user64258 Feb 09 '09 at 18:06
  • I see. I'd think that, with simple functions, it shouldn't take too long to get up to speed with either tool, or even to do this in Python or Perl. – Scottie T Feb 09 '09 at 18:09
  • I'd think that they are relatively simple (I added detail to the question), and I've already spent an hour or so on Google, which is why I've turned here ;-) – user64258 Feb 09 '09 at 18:19
  • Unfortunately, polyfit will only work for single-valued functions f(x). The OP specifically mentioned non-linear multi-dimensional fitting, which I doubt Octave supports out of the box. – Kena Feb 09 '09 at 19:02
  • I don't think you're going to get much simpler than that Octave code (or Numpy/Scipy in Python, which has nearly the same syntax - see http://www.scipy.org ). – David Z Feb 09 '09 at 19:05
  • But it's not multidimensional right? Only fits to one dimensional case. – tkarahan Jun 08 '21 at 07:42
1

Do you know to what power you want to limit your polynomial?

If there is no limit, then you can always get an exact match for N points by matching it to a polynomial that has N coefficients. To do this, you plug N different points into your equation, yielding N equations and N unknowns (the coefficients), which you can then use either simple high school algebra or a matrix to solve for the unknowns.

mbeckish
  • 10,485
  • 5
  • 30
  • 55
  • +1, I have read somewhere that sparse grid data can be used to achieve the same polynomial accuracy with less number of nodes than is required at regular grid data. Do you know how is that possible? – owari Apr 12 '13 at 06:56
0

If you have a guess at the form of f,[*] you need a minimizer to find the optimal parameters. The tools Scottie T suggests would work, as would ROOT, and many others.

If you don't have a clue what form f might take you are in deep trouble indeed.


[*] That is, you know that

f = f(x,y,z,w,...;p1,p2,p3...)

where the ps are parameters and the coordinates are x, y...

Community
  • 1
  • 1
dmckee --- ex-moderator kitten
  • 98,632
  • 24
  • 142
  • 234
0

Short Answer: it isn't so simple. Consider a non-parametric approach on data sub-sets.

There are 2 main issues you need to decide about (1) Do you actually care about the parameters of the function, i.e. your P1, E1, ..., or would you be okay with just estimating the mean function (2) do you really need to estimate the function on all of the data?

The first thing I'll mention is that your specified function is non-linear (in the parameters to be estimated), so ordinary least squares won't work. Let's pretend that you specified a linear function. You'd still have a problem with the 10M values. Linear regression can be performed in an efficient way using QR factorization, but you are still left with an O(p * n^2) algorithm, where p is the number of parameters you are trying to estimate. If you want to estimate the non-linear mean function it gets much worse.

The only way you are going to be able to estimate anything in such a large data set is by using a subset to perform the estimation. Basically, you randomly select a subset and use that to estimate the function.

If you don't care about your parameter values, and just want to estimate the mean function you will probably be better off using a non-parametric estimation technique.

Hopefully this helps.

leif

leif
  • 3,003
  • 1
  • 19
  • 9