6

I'm trying to fit a model using loess, and I'm getting errors such as "pseudoinverse used at 3", "neighborhood radius 1", and "reciprocal condition number 0". Here's a MWE:

x = 1:19
y = c(NA,71.5,53.1,53.9,55.9,54.9,60.5,NA,NA,NA
      ,NA,NA,178.0,180.9,180.9,NA,NA,192.5,194.7)
fit = loess(formula = y ~ x,
        control = loess.control(surface = "direct"),
        span = 0.3, degree = 1)
x2 = seq(0,20,.1)
library(ggplot2)
qplot(x=x2
    ,y=predict(fit, newdata=data.frame(x=x2))
    ,geom="line")

I realize I can fix these errors by choosing a larger span value. However, I'm trying to automate this fit, as I have about 100,000 time series (each of length about 20) similar to this. Is there a way that I can automatically choose a span value that will prevent these errors while still providing a fairly flexible fit to the data? Or, can anyone explain what these errors mean? I did a bit of poking around in the loess() and simpleLoess() functions, but I gave up at the point when C code was called.

random_forest_fanatic
  • 1,232
  • 1
  • 12
  • 30
  • You may find this post useful: https://stat.ethz.ch/pipermail/r-help/2005-November/082853.html. You can compute AIC of loess fits with several different spans, and choose the span with minimum AIC. – bdemarest Dec 17 '14 at 18:14
  • @bdemarest Thanks for that link! However, I'm trying to figure out a way to "mathematically" choose span instead of via AIC/cross-validation/etc. It's too computationally expensive for my scenario to run each fit multiple times. – random_forest_fanatic Dec 17 '14 at 18:17
  • Please let me know what solution you end up using. My own efforts have led me to believe that closed-form solutions to loess optimization problems just aren't possible, but I would love to learn a better/faster way of choosing span. – bdemarest Dec 17 '14 at 18:31

1 Answers1

7

Compare fit$fitted to y. You'll notice that something is wrong with your regression. Choose adequate bandwidth, otherwise it'll just interpolate the data. With too few data points, linear function behaves like constant on small bandwidth and triggers collinearity. Thus, you see the errors warning pseudoinverses, singularities. You wont see such errors if you use degree=0 or ksmooth. One intelligible, data-driven choice of span is to use to cross-validation, about which you can ask at Cross Validated.

> fit$fitted
 [1]  71.5  53.1  53.9  55.9  54.9  60.5 178.0 180.9 180.9 192.5 194.7
> y
 [1]    NA  71.5  53.1  53.9  55.9  54.9  60.5    NA    NA    NA    NA    NA 178.0
[14] 180.9 180.9    NA    NA 192.5 194.7

You see over-fit( perfect-fit) because in your model number of parameters are as many as effective sample size.

fit
#Call:
#loess(formula = y ~ x, span = 0.3, degree = 1, control = loess.control(surface = "direct"))

#Number of Observations: 11 
#Equivalent Number of Parameters: 11 
#Residual Standard Error: Inf 

Or, you might as well just use automated geom_smooth. (again setting geom_smooth(span=0.3) throws warnings)

ggplot(data=data.frame(x, y), aes(x, y)) + 
  geom_point() + geom_smooth()

enter image description here

Community
  • 1
  • 1
Khashaa
  • 7,293
  • 2
  • 21
  • 37
  • I'm looking at fit$fitted vs y, and I don't see the problem you're referring to (unless you mean that I don't have any y-values in [80,160]). I thought that a span of 0.3 would mean max distance=(19-1)*.3=5.4. So, when estimating y at x=16, for example, wouldn't the function use the observations where x=14, 15, and 18? How does that model have collinearity (since (1,1,1) is independent of (14,15,18))? – random_forest_fanatic Dec 17 '14 at 17:55
  • Oh, and cross-validation is a great idea! But, it's too computationally expensive for my purposes because I have to fit this model to many time series. – random_forest_fanatic Dec 17 '14 at 17:57
  • Sorry, I should have post this as a comment, as it is not very constructive about how to tune the `span` parameter. I think my remark about collinearity is not completely misplaced because for some interval where there is only one observation available, it necessarily creates collinearity. 5 consecutive NAs in `y` suggests at least 2 such points. – Khashaa Dec 17 '14 at 19:41
  • I agree that collinearity becomes a problem when you only have one observation (or, only 1 unique x value). But, I'm confused about what's happening in the loess model at x=16. There are three points close enough to be used in the fit, yet it seems to lead to a numerical issue. Any idea what's happening at that point? – random_forest_fanatic Dec 17 '14 at 19:50