3

I'm trying to write a function that regresses multiple items, then tries to predict data based on the model:

"tnt" <- function(train_dep, train_indep, test_dep, test_indep) 
{
    y <- train_dep
    x <- train_indep
    mod <- lm (y ~ x)
    estimate <- predict(mod, data.frame(x=test_indep))
    rmse <- sqrt(sum((test_dep-estimate)^2)/length(test_dep)) 
    print(summary(mod))
    print(paste("RMSE: ", rmse))        
}

If I pass the above this, it fails:

train_dep = vector1
train_indep <- cbind(vector2, vector3)
test_dep = vector4
test_indep <- cbind(vector5, vector6)
tnt(train_dep, train_indep, test_dep, test_indep)

Changing the above to something like the following works, but I want this done dynamically so I can pass it a matrix of any number of columns:

x1 = x[,1]
x2 = x[,2]
mod <- lm(y ~ x1+x2)
estimate <- predict(mod, data.frame(x1=test_indep[,1], x2=test_indep[,2]))

Looks like this could help, but I'm still confused on the rest of the process: http://finzi.psych.upenn.edu/R/Rhelp02a/archive/70843.html

Dolan Antenucci
  • 15,432
  • 17
  • 74
  • 100
  • 2
    Have you tried `as.formula()`? You can then manipulate the formula using text manipulation until you get it how you want (e.g. have the function you wrote create the formula based on the inputs), and then use as.formula to make it something that `lm` will accept. – Ari B. Friedman Aug 06 '11 at 16:32
  • 2
    What you're looking for is `as.formula` in combination with `paste`. – Roman Luštrik Aug 06 '11 at 16:33
  • 2
    If you pass arguments into your function as a data.frame (or two data.frames in your case), you could regress using the formula annotation. Assuming you have a data.frame with columns y, x1 and x2, you would write `lm(y ~ ., data = your.df)`. See also http://stackoverflow.com/questions/6951090/what-does-the-period-mean-in-the-following-r-excerpt what period stands for. – Roman Luštrik Aug 06 '11 at 17:15

2 Answers2

2

Modified using the as.formula suggestion in the comments. Roman's comment above about passing all as one data.frame and using the . notation in formulas is probably the best solution, but I implemented it in paste because you should know how to use paste and as.formula :-).

tnt <- function(train_dep, train_indep, test_dep, test_indep) {
    form <- as.formula(paste("train_dep ~", paste( "train_indep$",colnames(train_indep) ,sep="",collapse=" + " ), sep=" "))
    mod <- lm(form)
    estimate <- predict(mod, data.frame(x=test_indep))
    rmse <- sqrt(sum((test_dep-estimate)^2)/length(test_dep)) 
    print(summary(mod))
    print(paste("RMSE: ", rmse))        
}
Ari B. Friedman
  • 71,271
  • 35
  • 175
  • 235
  • I get an error when I use this function: "Error in parse(text = x) : :2:0: unexpected end of input\n1: train_dep ~ \n^" (I'm using R 64-bit v1.40-devel on OS X) – Dolan Antenucci Aug 07 '11 at 00:30
  • @dolan That's because I left out the as.formula. Should work now. – Ari B. Friedman Aug 07 '11 at 07:07
  • I'm getting the same error, but maybe its an issue with the data I'm using.. Anyway, DWin's function is working for me, so no need to debug this any further, but if you're curious as to what I'm using, here is my test call: tnt(c(1, 2, 3), cbind(c(1, 2, 3), c(1, 2, 3)), c(4, 5, 6), cbind(c(4, 5, 6), c(4, 5, 6))). Thanks for your help! – Dolan Antenucci Aug 07 '11 at 18:10
  • Forgot to add in the data frame reference before each variable in the formula. Another reason to go with Roman/Dwin's approach. – Ari B. Friedman Aug 07 '11 at 18:20
2

Try this instead:

tnt <- function(train_dep, train_indep, test_dep, test_indep) 
{   dat<- as.data.frame(cbind(y=train_dep, train_indep))
    mod <- lm (y ~ . , data=dat ) 
    newdat <- as.data.frame(test_indep)
   names(newdat) <- names(dat)[2:length(dat)]

 estimate <- predict(mod, newdata=newdat )
 rmse <- sqrt(sum((test_dep-estimate)^2)/length(test_dep)) 
 print(summary(mod))
 print(paste("RMSE: ", rmse))        
}


Call:
lm(formula = y ~ ., data = dat)

Residuals:
1 2 3 
0 0 0 

Coefficients: (1 not defined because of singularities)
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)        0          0      NA       NA    
V2                 1          0     Inf   <2e-16 ***
V3                NA         NA      NA       NA    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0 on 1 degrees of freedom
Multiple R-squared:     1,  Adjusted R-squared:     1 
F-statistic:   Inf on 1 and 1 DF,  p-value: < 2.2e-16 

[1] "RMSE:  0"
Warning message:
In predict.lm(mod, newdata = newdat) :
  prediction from a rank-deficient fit may be misleading
> 

The warning is because of the exact fit you are offering

IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • I get an error when I use this function: "Error in eval(expr, envir, enclos) : object 'V2' not found" (I'm using R 64-bit v1.40-devel on OS X) – Dolan Antenucci Aug 07 '11 at 00:31
  • Same OS as I am using. I'm guessing that you are giving a dataframe to the argument of test_indep that has a different number of columns than the independent variables on the RHS of the model. If 'train_indep' has 2 columns, then 'test_indep' needs to have 2 columns as well. Why not post the results of str() on both of the indep arguments? – IRTFM Aug 07 '11 at 01:51
  • This is my test code: tnt(c(1, 2, 3), cbind(c(1, 2, 3), c(1, 2, 3)), c(4, 5, 6), cbind(c(4, 5, 6), c(4, 5, 6))) – Dolan Antenucci Aug 07 '11 at 03:41
  • OK. I modified the code to deal with matrices without any column names. – IRTFM Aug 07 '11 at 03:49