In R, how do you get the best fitting equation to a set of data?

Question

I'm not sure wether R can do this (I assume it can, but maybe that's just because I tend to assume that R can do anything :-)). What I need is to find the best fitting equation to describe a dataset.

For example, if you have these points:

df = data.frame(x = c(1, 5, 10, 25, 50, 100), y = c(100, 75, 50, 40, 30, 25))

How do you get the best fitting equation? I know that you can get the best fitting curve with:

plot(loess(df$y ~ df$x))

But as I understood you can't extract the equation, see Loess Fit and Resulting Equation.

When I try to build it myself (note, I'm not a mathematician, so this is probably not the ideal approach :-)), I end up with smth like:

y.predicted = 12.71 + ( 95 / (( (1 + df$x) ^ .5 ) / 1.3))

Which kind of seems to approximate it - but I can't help to think that smth more elegant probably exists :-)

I have the feeling that fitting a linear or polynomial model also wouldn't work, because the formula seems different from what those models generally use (i.e. this one seems to need divisions, powers, etc). For example, the approach in Fitting polynomial model to data in R gives pretty bad approximations.

I remember from a long time ago that there exist languages (Matlab may be one of them?) that do this kind of stuff. Can R do this as well, or am I just at the wrong place?

(Background info: basically, what we need to do is find an equation for determining numbers in the second column based on the numbers in the first column; but we decide the numbers ourselves. We have an idea of how we want the curve to look like, but we can adjust these numbers to an equation if we get a better fit. It's about the pricing for a product (a cheaper alternative to current expensive software for qualitative data analysis); the more 'project credits' you buy, the cheaper it should become. Rather than forcing people to buy a given number (i.e. 5 or 10 or 25), it would be nicer to have a formula so people can buy exactly what they need - but of course this requires a formula. We have an idea for some prices we think are ok, but now we need to translate this into an equation.

I believe you are trying to do it the wrong way around. Normally you look for a model from the science (chemistry, physics, ...) and then you try to fit it. You have to chose a subset of models you want to try, as there is an infinite number of possible models. — Roland, Oct 11 '12 at 09:00
Thank you for your reaction @Roland! I'm not doing science (that is, not this moment :-)) - I just need an equation to describe a dataset more 'parsimoniously' than by listing all the datapoints. I'll explain a bit more in the question, maybe that helps! — Matherion, Oct 11 '12 at 09:32

score 4 · Answer 1 · answered Oct 11 '12 at 09:06

4

Multiple Linear Regression Example

fit <- lm(y ~ x1 + x2 + x3, data=mydata)

summary(fit) # show results

The code above should give you the line that best fits your data using OLS.

answered Oct 11 '12 at 09:06

philq

157
1
1
8

Thank you @Philq02! That would be very helpful if I wanted to find the best fit of a linear model. Sadly, I want the best fit in general; and it looks like the best fit contains a division (e.g. a/X + b * X where a * b would need to be estimated). Hey - this gives me an idea - maybe I can use OLS and provide 1/X as one of the predictors. I'll go try this out immediately, I'll report back :-) Thank you again!!! – Matherion Oct 11 '12 at 09:37
I've played around this this (I added `df$div_x = 1/df$x` and `df$x_sq = df$x^2` and then ran `fit <- lm(y ~ x + div_x + x_sq, data=df)`, which gives an ok approximation, so this is definitely an improvement, thank you! I'll keep this open a bit more in case there exist other (better) ways, but again, thank you! – Matherion Oct 11 '12 at 09:46

score 4 · Accepted Answer · answered Oct 11 '12 at 11:21

4

My usual plug: http://creativemachines.cornell.edu/eureqa

But as Roland said, the "best fit in general" has little meaning, since any function can be expressed as a Taylor series. Since a set of data is expected to have noise aka errors in its values, a big part of curve-fitting is determining what is noise and what isn't.
If you pick some fit function arbitrarily, one thing I can pretty much guarantee is that extrapolated points will diverge in a hurry.

answered Oct 11 '12 at 11:21

Carl Witthoft

20,573
9
43
73

Wow @Carl, this is great! This is exactly what I need! Thank you also for the advice. You're right of course. However, as may (or may not :-)) become clear in the extra background information, my goal is interpolation rather than extrapolation. Also, Eureqa allows me to play with different equations so that I can explore extrapolation. So again, thank you! I'm sorry, but because this was my first post, I can't vote your answer up . . . – Matherion Oct 11 '12 at 11:42
(in case anybody cares/is interested: when only looking at the pairs for which `x <- c(1, 5, 10, 100);` and `y <- c(100, 75, 50, 25);` and feeding those into Eureqa, one of the equations it generates is `y.predicted <- 100/x^.3;`, which gives quite a decent fit - good enough for my purposes at least. Thanks again everybody! I hope others with similar problems stumble upon this page :-)) – Matherion Oct 11 '12 at 12:16

In R, how do you get the best fitting equation to a set of data?

2 Answers2

Multiple Linear Regression Example