0

I've created a function which fits polynomial regression models with increasing degree upto the input degree. I also collect all such models in a list.

After executing this function for a given set of inputs, I want to inspect the model list to calculate the MSE. However I see that the individual models refer to parameter names within the function.

Question: How do I make the glm objects refer to actual variables

Function definition:

poly.iter = function(dep,indep,dat,deg){ #Function to iterate through polynomial fits upto input degree
  set.seed(1)
  par(mfrow=c(ceiling(sqrt(deg)),ceiling(sqrt(deg)))) #partitioning the plotting window
  MSE.CV = rep(0,deg)
  modlist = list()
  xvar = seq(from=min(indep),to=max(indep),length.out = nrow(dat))
  for (i in 1:deg){
    mod = glm(dep~poly(indep,i),data=dat)
    #MSE.CV[i] = cv.glm(dat,mod,K=10)$delta[2] #Inside of this function, cv.glm is generating warnings. Googling has not helped as it can typically happen with missing obs but we don't have any in Auto data
    modlist = c(modlist,list(mod))
    MSE.CV[i] = mean(mod$residuals^2) #GLM part is giving 5x the error i.e. delta is 5x of MSE. Not sure why

    plot(jitter(indep),jitter(dep),cex=0.5,col="darkgrey")
    preds = predict(mod,newdata=list(indep=xvar),se=T)
    lines(xvar,preds$fit,col="blue",lwd=2)
    matlines(xvar,cbind(preds$fit+2*preds$se.fit,preds$fit-2*preds$se.fit),lty=3,col="blue")
  }
  return(list("models"=modlist,"errors"=MSE.CV))
}

Function Call:

output.mpg.disp = poly.iter(mpg,displacement,Auto,9)

Inspecting 3rd degree model:

> output.mpg.disp[[1]][[3]]

Call:  glm(formula = dep ~ poly(indep, i), data = dat)

Coefficients:
    (Intercept)  poly(indep, i)1  poly(indep, i)2  poly(indep, i)3  
         23.446         -124.258           31.090           -4.466  

Degrees of Freedom: 391 Total (i.e. Null);  388 Residual
Null Deviance:      23820 
Residual Deviance: 7392     AIC: 2274

Now I can't use this object inside cv.glm with 'Auto' dataset as it will not recognize indep, dep and i

Sammy25
  • 45
  • 1
  • 7
  • Well, your function doesn't work as written. If you define your function in in a fresh R session, `poly.iter(mpg,displacement,Auto,9)` will get you `Error in seq(from = min(indep), to = max(indep), length.out = nrow(dat)) : object 'displacement' not found`, unless you `attach` the data (which is a Bad Idea). You have to do a little work to accept unquoted column names as arguments. – Gregor Thomas Jan 16 '18 at 22:05
  • 2
    The easiest way would probably be to have the user input a `formula` like any other modeling function: `poly.iter(mpg ~ displacement, data = Auto, deg =9)`. Then you can edit the formula as needed and pass a proper formula to your internal `glm` call. – Gregor Thomas Jan 16 '18 at 22:07
  • I don't know if you're looking for input on the idea behind this, but it seems highly questionable. I'm reluctant to use quadratic terms, and for higher order polynomials I wouldn't touch them unless I had a nice theoretical equation justifying the choice. Just use a cubic spline or a GAM - it won't be any less interpretable, and it will be more flexible with good theoretical backing. – Gregor Thomas Jan 16 '18 at 22:10
  • Not sure if it's good SO etiquette to have a conversation in comments, so let me know. Otherwise: What's the problem with attaching a dataset I'm just getting started with ISLR, so doing this as part of exercise. But I'm intrigued by the idea that polynomial has less backing than GAM. Any references? – Sammy25 Jan 17 '18 at 02:14
  • It's not great to have a conversation in comments, but whatever. As far as `attach` being bad, see [this SO question](https://stackoverflow.com/q/10067680/903061), this [blog entry](http://sas-and-r.blogspot.com/2011/05/to-attach-or-not-attach-that-is.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+SASandR+%28SAS+and+R%29), `fortunes::fortune("attach")`, the [Google Style Guide for R](https://google.github.io/styleguide/Rguide.xml#attach), etc.... – Gregor Thomas Jan 17 '18 at 04:11
  • For poly vs splines, I think [this stats.stackexchange question is good](https://stats.stackexchange.com/q/49052/7515), read both answers and comments. I think Frank Harrell's books deal with this, see *Regression Modeling Strategies*. ISLR has it in Section 7.4.5. If you graduate to ESL, they hardly mention polynomial fits (except for the local polynomial regression). Overall, I think high-degree polynomials have the similar advantages and disadvantages to splines, but the advantages aren't quite as good and the disadvantages (extrapolation, bad boundary behavior) are worse. – Gregor Thomas Jan 17 '18 at 04:24

1 Answers1

1

You can use the as.formula() function to transform a string with your formula before calling glm(). This will solve your question (How do I make the glm objects refer to actual variables), but I'm not sure if it is enough for the calling cv.glm later (I couldn't reproduce your code here, without errors). To be clear, you replace the line

mod = glm(dep~poly(indep,i),data=dat)

with something like:

myexp = paste0(dep, "~ poly(", indep, ",", i, ")")

mod = glm(as.formula(myexp), data=dat)

it's required then to make the variables dep and indep to be characters with names of the variables that you want to refer to (e.g. indep="displ").

Community
  • 1
  • 1
IBrum
  • 345
  • 1
  • 9
  • Thank you. While your answer does solve the problem at hand, I was looking for something more elegant that R may have to offer as in making some basic change to how the function is called. – Sammy25 Jan 17 '18 at 00:35
  • you can also try this approach: [fix model call in model using as.formula](https://stackoverflow.com/a/35804514/7774591) – IBrum Jan 17 '18 at 14:47