variable lengths differ in R

Question

I am getting the error above when trying to use the cv.lm fucntion. Please see my code

sample<-read.csv("UU2_1_lung_cancer.csv",header=TRUE,sep=",",na.string="NA")
  sample1<-sample[2:2000,3:131]
  samplex<-sample[2:50,3:131]
  y<-as.numeric(sample1[1,]) 
  y<-as.numeric(sample1[2:50,2]) 
  x1<-as.numeric(sample1[2:50,3])
  x2<-as.numeric(sample1[2:50,4])
  x11<-x1[!is.na(y)]
  x12<-x2[!is.na(y)]
  y<-y[!is.na(y)]
  fit1 <- lm(y ~ x11 + x12, data=sample)
  fit1
  x3<-as.numeric(sample1[2:50,5])
  x4<-as.numeric(sample1[2:50,6])
  x13<-x3[!is.na(y)]
  x14<-x4[!is.na(y)]
  fit2 <- lm(y ~ x11 + x12 + x13 + x14, data=sample)
  anova(fit1,fit2)
  install.packages("DAAG")
  library("DAAG")
  cv.lm(df=samplex, fit1, m=10) # 3 fold cross-validation

Any insight will be appreciated.

Example of data
ID       peak height     LCA001 LCA002  LCA003
N001786 32391.111   0.397   0.229   -0.281
N005356 32341.473   0.397   -0.655  -1.301
N002416 32215.474   -0.703  -0.214  -0.901
GS239   31949.777   0.354   0.118   0.272
N016343 31698.853   0.226   0.04    -0.006
N003255 31604.978   0.024    NA -0.534
N004358 31356.597   -0.252  -0.022  -0.407
N000122 31168.09    -0.487  -0.533  -0.134
GS10564 31106.103   -0.156  -0.141  -1.17
GS17987 31043.876    NA     0.253   0.553
N003674 30876.207   0.109   0.093   0.07

Please see the example of the data above

post a sample of your data needed to run this code, and at what point do you get the error — rawr, Apr 20 '14 at 14:42
ID peak height LCA001 LCA002 LCA003 N001786 32391.111 0.397 0.229 -0.281 N005356 32341.473 0.397 -0.655 -1.301 N002416 32215.474 -0.703 -0.214 -0.901 GS239 31949.777 0.354 0.118 0.272 N016343 31698.853 0.226 0.04 -0.006 N003255 31604.978 0.024 -0.17 -0.534 N004358 31356.597 -0.252 -0.022 -0.407 N000122 31168.09 -0.487 -0.533 -0.134 GS10564 31106.103 -0.156 -0.141 -1.17 GS17987 31043.876 0.253 0.553 N003674 30876.207 0.109 0.093 0.07 — user3424320, Apr 20 '14 at 18:37

jlhoward · Accepted Answer · 2014-04-20T19:17:57.637

1

First, you are using lm(..) incorrectly, or at least in a very unconventional way. The purpose of specifying the data=sample argument is so that the formula uses references to columns of the sample. Generally, it is a very bad practice to use free-standing data in the formula reference.

So try this:

## not tested...
sample <- read.csv(...)
colnames(sample)[2:6] <- c("y","x1","x2","x3","x4")
fit1 <- lm(y~x1+x2, data=sample[2:50,],na.action=na.omit)
library(DAAG)
cv.lm(df=na.omit(sample[2:50,]),fit1,m=10)

This will give columns 2:6 the appropriate names and then use those in the formula. The argument na.action=na.omit tells the lm(...) function to exclude all rows where there is an NA value in any of the relevant columns. This is actually the default, so it is not needed in this case, but included for clarity.

Finally, cv.lm(...) uses it's second argument to find the formula definition, so in your code:

cv.lm(df=samplex, fit1, m=10)

is equivalent to:

cv.lm(df=samplex,y~x11+x12,m=10)

Since there are (presumeably) no columns named x11 and x12 in samplex, and since you define these vectors externally, cv.lm(...) throws the error you are getting.

edited Apr 20 '14 at 19:17

answered Apr 20 '14 at 15:26

jlhoward

58,004
7
97
140

In this equation, how do one handle the NA? – user3424320 Apr 20 '14 at 18:39
Good point. Evidently, you have to omit NAs explicitly from the `df` argument. See my edits. This seems to work on the sample you provided. BTW: it is *much* better to post the data as an edit to the question, rather than in a comment. Type `dput(mydata)` and paste the output into your question. – jlhoward Apr 20 '14 at 19:17
@jlhowards: I cannot see your edits. Can you please resend what you tested – user3424320 Apr 20 '14 at 19:26
Thank you my pc was acting up. I got it Thanks – user3424320 Apr 20 '14 at 19:30

variable lengths differ in R

1 Answers1

Linked