10

There are some related questions that I've come across (like this, this, this, and this) but they all deal with fitting data to a known curve. Is there a way to fit given data to an unknown curve? By which I mean, given some data the algorithm will give me a fit which is one function or a sum of functions. I'm programming in C, but I'm at a complete loss on how to use the gsl package to do this. I'm open to using anything that can (ideally) be piped through C. But any help on what direction I should look will be greatly appreciated.

EDIT: This is basically experimental (physics) data that I've collected, so the data will have some trend modified by additive gaussian distributed noise. In general the trend will be non-linear, so I guess that a linear regression fitting method will be unsuitable. As for the ordering, the data is time-ordered, so the curve necessarily has to be fit in that order.

Community
  • 1
  • 1
Kitchi
  • 1,874
  • 4
  • 28
  • 46
  • 1
    You might like to read the [article about curve fitting on Wikipedia](http://en.wikipedia.org/wiki/Curve_fitting). – pmg Jan 01 '13 at 13:25
  • *"By which I mean, given some data the algorithm will give me a fit which is one function or a sum of functions"*: Then your curve is known to be a sum of certain functions. I guess the [Taylor series](http://en.wikipedia.org/wiki/Taylor_series) and the [Fourier series](http://en.wikipedia.org/wiki/Fourier_series) may be what you're looking for, because they allow the representation of a function as an infinite sum of terms. – Thomas C. G. de Vilhena Jan 01 '13 at 13:46
  • 2
    This is a bit underspecified. There are lots of curve-fitting methods (regression, Taylor series, Fourier, spline, etc.), but they're each tailored to particular applications. Without knowing something about your data (i.e. what generated the underlying trend in the first place), it's a little hard to give a specific answer. – Oliver Charlesworth Jan 01 '13 at 15:45
  • Do you know the ordering (a.k.a parametrization) of the data, or is it also subject to change based on the curve fit? i.e. If your points are ordered as: {Pt1, Pt2, Pt3}; do you want them to end up in the same order when projected on to the curve, or might they end up being something like {Pt2, Pt1, Pt3}? If the ordering is fixed, @amit 's answer below seems reasonable. Otherwise, there are more complicated algorithms to both order the data set based on some error, and provide a fit in the meantime. Please elaborate this in your question. – meyumer Jan 01 '13 at 21:10

3 Answers3

9

You might be looking for polynomial interpolation, in the field of numerical analysis.

In polynomial interpolation - given a set of points (x,y) - you are trying to find the best polynom that fits these points. One way to do it is using Newton interpolation, which is fairly easy to program.

The field of numerical analysis and interpolations in specifics is widely studied, and you might be able to get some nice upper bound to the error of the polynom.

Note however, because you are looking for a polynom that best fits your data, and the function is not really a polynom - the scale of the error when getting far from your initial training set blasts off.


Also note, your data set is finite, and there are inifnite number (actually, non-enumerable infinity) of functions that can fit the data (exactly or approximately) - so which one out of these is the best might be specific to what you actually are trying to achieve.

If you are looking for a model to fit your data, note that linear regression and polynomial interpolations are at the opposite ends of the scale: polynomial interpolation might be an overfitting to a model, while a linear regression might be underfitting it, what exactly should be used is case specific and varies from one application to the other.


Simple polynomial interpolation example:

Let's say we have (0,1),(1,2),(3,10) as our data.

The table1 we get using newton method is:

0  | 1 |                 |
1  | 2 | (2-1)/(1-0)=1   |
3  | 9 | (10-2)/(3-1)=4  | (4-1)/(3-0)=1

Now, the polynom we get is the "diagonal" that ends with the last element:

1 + 1*(x-0) + 1*(x-0)(x-1) = 1 + x + x^2 - x = x^2 +1 

(and that is a perfect fit indeed to the data we used)


(1) The table is recursively created: The first 2 columns are the x,y values - and each next column is based on the prior one. It is really easy to implement once you get it, the full explanation is in the wikipedia page for newton interpolation.

Community
  • 1
  • 1
amit
  • 175,853
  • 27
  • 231
  • 333
  • 1
    Downvoter: Please comment. It might not be a perfect fit for anything - but there is no silver bullet, and I tried to explain the drawbacks of this approach as well in the answer. – amit Jan 01 '13 at 14:26
4

You might want to use (Fast) Fourier Transforms to convert data to frequency domain.

With the result of the transform (a set of amplitudes and phases and frequencies) even the most twisted set of data can be represented by several functions (harmonics) of the form:

r * cos(f * t - p)

where r is the harmonic amplitude, f is the frequency an p the phase.

Finally, the unknonwn data curve is the sum of all harmonics.

I have done this in R (you have some examples of it) but I believe C has enough tools to manage it. It is also possible to pipe C and R but don't know much about it. This might be of help.

This method is really good for large chunks of data because it has complexities of:

1) decompose data with Fast Fourier Transforms (FTT) = O(n log n)

2) built the function with the resulting components = O(n)

Community
  • 1
  • 1
Helio Santos
  • 6,606
  • 3
  • 25
  • 31
  • 1
    But wouldn't this be more computationally expensive since I have to fourier transform _then_ figure out the components? Although to me it does sound quite foolproof for an arbitrary curve... – Kitchi Jan 01 '13 at 17:57
  • @Kitchi Fast Fourier Transforms(FTT) are much faster than classical Discrete Fourier Transforms (http://blogs.mathworks.com/steve/2012/05/01/the-dft-matrix-and-computation-time/). Figure out the components is not a problem since it has a complexity, I think, of O(n) – Helio Santos Jan 01 '13 at 19:11
  • Some FTT implementations have a complexity of O(n log n) meaning they have a small computational coast. This mean that they can easily manage large chunks of data while methods using Polynomial Interpolation O(n^2) can't. – Helio Santos Jan 01 '13 at 19:41
4

Another alternative is using linear regression, but multi-dimensional.

The trick here is to artificially generate extra dimensions. You can do so by simply implying some functions on the original data set. A common usage is doing it to generate polynoms to match the data, so in here the function you imply is f(x) = x^i for all i < k (where k is the degree of the polynom you want to get).

For example, the data set (0,2),(2,3) with k = 3 you will get extra 2 dimnsions, and your data set will be: (0,2,4,8),(2,3,9,27).

The linear-regression algorithm will find the values a_0,a_1,...,a_k for the polynom p(x) = a_0 + a_1*x + ... + a_k * x^k that minimized the error for each point in the data comparing to the predicted model (the value of p(x)).

Now, the problem is - when you start increasing the dimension - you are moving from underfitting (of 1 dimensional linear regression) to overfitting (when k==n, you effectively getting polynomial interpolation).

To "chose" what is the best k value - you can use cross-validation, and chose the k that minimized the error according to your cross-validation.

Note that this process can be fully automated, all you need is to iteratively check all k values in the desired range1, and chose the model with the k that minimized the error according to the cross-validation.


(1) The range could be [1,n] - though it will probably be way too time consuming, I'd go for [1,sqrt(n)] or even [1,log(n)] - but it is just a hunch.

amit
  • 175,853
  • 27
  • 231
  • 333