11

If I have some (x,y) data, I can easily draw straight-line through it, e.g.

f=glm(y~x)
plot(x,y)
lines(x,f$fitted.values)

But for curvy data I want a curvy line. It seems loess() can be used:

f=loess(y~x)
plot(x,y)
lines(x,f$fitted)

This question has evolved as I've typed and researched it. I started off with wanting to a simple function to fit curvy data (where I know nothing about the data), and wanting to understand how to use nls() or optim() to do that. That was what everyone seemed to be suggesting in similar questions I found. But now I stumbled upon loess() I'm happy. So, now my question is why would someone choose to use nls or optim instead of loess (or smooth.spline)? Using the toolbox analogy, is nls a screwdriver and loess is a power-screwdriver (meaning I'd almost always choose the latter as it does the same thing but with less of my effort)? Or is nls a flat-head screwdriver and loess a cross-head screwdriver (meaning loess is a better fit for some problems, but for others it simply won't do the job)?

For reference, here is the play data I was using that loess gives satisfactory results for:

x=1:40
y=(sin(x/5)*3)+runif(x)

And:

x=1:40
y=exp(jitter(x,factor=30)^0.5)

Sadly, it does less well on this:

x=1:400
y=(sin(x/20)*3)+runif(x)

Can nls(), or any other function or library, cope with both this and the previous exp example, without being given a hint (i.e. without being told it is a sine wave)?

UPDATE: Some useful pages on the same theme on stackoverflow:

Goodness of fit functions in R

How to fit a smooth curve to my data in R?

smooth.spline "out of the box" gives good results on my 1st and 3rd examples, but terrible (it just joins the dots) on the 2nd example. However f=smooth.spline(x,y,spar=0.5) is good on all three.

UPDATE #2: gam() (from mgcv package) is great so far: it gives a similar result to loess() when that was better, and a similar result to smooth.spline() when that was better. And all without hints or extra parameters. The docs were so far over my head I felt like I was squinting at a plane flying overhead; but a bit of trial and error found:

#f=gam(y~x)    #Works just like glm(). I.e. pointless
f=gam(y~s(x)) #This is what you want
plot(x,y)
lines(x,f$fitted)
Community
  • 1
  • 1
Darren Cook
  • 27,837
  • 13
  • 117
  • 217
  • 2
    A long answer could be written for this. But I can clear one thing up perhaps. You are aware that `loess` has a `span` and `degree` parameter, right? And that these influence the fitted model? Try using `span = 0.1` for your last example data. – joran Sep 26 '11 at 04:25
  • Thanks @joran, that is useful to know. Though having to specify a different span for different equations counts as a hint. – Darren Cook Sep 26 '11 at 07:01

2 Answers2

25

Nonlinear-least squares is a means of fitting a model that is non-linear in the parameters. By fitting a model, I mean there is some a priori specified form for the relationship between the response and the covariates, with some unknown parameters that are to be estimated. As the model is non-linear in these parameters NLS is a means to estimate values for those coefficients by minimising a least-squares criterion in an iterative fashion.

LOESS was developed as a means of smoothing scatterplots. It has a very less well defined concept of a "model" that is fitted (IIRC there is no "model"). LOESS works by trying to identify pattern in the relationship between response and covariates without the user having to specify what form that relationship is. LOESS works out the relationship from the data themselves.

These are two fundamentally different ideas. If you know the data should follow a particular model then you should fit that model using NLS. You could always compare the two fits (NLS vs LOESS) to see if there is systematic variation from the presumed model etc - but that would show up in the NLS residuals.

Instead of LOESS, you might consider Generalized Additive Models (GAMs) fitted via gam() in recommended package mgcv. These models can be viewed as a penalised regression problem but allow for the fitted smooth functions to be estimated from the data like they are in LOESS. GAM extends GLM to allow smooth, arbitrary functions of covariates.

Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
  • 1
    +1 nice answer as these are really incomparable. The MASS book has a section on smoothers comparing loess to alternatives. – Dirk Eddelbuettel Sep 26 '11 at 12:28
  • @Ben I thought Simon Wood used that description in describing a general form for the GAM - I can't imagine I made that up myself? I guess "arbitrary" in the sense that they could be any sort of smoother. – Gavin Simpson Sep 26 '11 at 14:13
  • Thanks @Gavin for the good explanation of the differences between these functions. If I want to evaluate a model I choose nls or similar; if I want to discover a model I choose gam or similar. – Darren Cook Oct 19 '11 at 03:13
  • 1
    @DarrenCook No, for the last bit I would say "...choose LOESS or similar". GAMs really are a formal statistical model that with modern theory etc are a useful part of the data analyst's toolbox. GLMs only allow certain forms of relationships. GAMs allow the real relationship *in the data* to be identified whilst remaining in a proper statistical framework. Everything becomes more approximate (i.e. inference) with GAMs than with GLMs, but inference is more approximate with GLMs than in linear models - just the price we have to pay. – Gavin Simpson Oct 19 '11 at 08:05
4

loess() is non-parametric, meaning you don't get a set of coefficients you can use later - it's not a model, just a fit line. nls() will give you coefficients you could use to build an equation and predict values with a different but similar data set - you can create a model with nls().

Josh
  • 41
  • 1