I use a lars
model and apply it to a large data set (75 features) with numerical data and factors.
I train the model by
mm <- model.matrix(target~0+.,data=data)
larsMod <- lars(mm,data$target,intercept=FALSE)
which gives a nice in-sample fit. If I apply it to testdata by
mm.test <- model.matrix(target~0+.,,data=test.data)
predict(larsMod,mm.test,type="fit",s=length(larsMod$arc.length))
then I get the error message
Error in scale.default(newx, object$meanx, FALSE) :
length of 'center' must equal the number of columns of 'x'
I assume that it has todo with the fact that factor levels differ in the data sets. However
which(! colnames(mm.test) %in% colnames(mm) )
gives an empty result while
which(! colnames(mm) %in% colnames(mm.test) )
gives 3 indizes. Thus 3 factor levels do appear in the training set but not in the test set. Why does this cause a problem? How can I solve this?
The code blow illustrates this with a toy example. In the test dataset the factor does not have the level "l3".
require(lars)
data.train = data.frame( target = c(0,1,0,1,1,1,1,0,0,0), f1 = rep(c("l1","l2","l1","l2","l3"),2), n1 = rep(c(1,2,3,4,5),2))
test.data = data.frame(f1 = rep(c("l1","l2","l1","l2","l2"),2),n1 = rep(c(7,4,3,4,5),2) )
mm <- model.matrix(target~0+f1+n1,data = data.train)
colnames(mm)
length(colnames(mm))
larsMod <- lars(mm,data.train$target,intercept=FALSE)
mm.test <- model.matrix(~0+f1+n1,data=test.data)
colnames(mm.test)
length( colnames(mm.test) )
which(! colnames(mm.test) %in% colnames(mm) )
which(! colnames(mm) %in% colnames(mm.test) )
predict(larsMod,mm.test,type="fit",s=length(larsMod$arc.length))