0

I was trying out linear regression and observe that I get this error in spite of all my factor columns having at least two levels.

I tracked down to the column which is giving me this error and this is the summary of that column

> summary(df[,30])
    0     1  <NA>
31543    14     0

> unique(df[,30])
[1] 0 1
Levels: 0 1 <NA>

I have also eliminated all rows which have an NA value by doing the following

df = na.omit(df)

Please note that the NA above is an additional factor level I have added using the addNA function.

How do I resolve this?

EDIT : I have placed a reproducible example at my public share on http://aftabubuntu.cloudapp.net/ . Please download the reproduce.RDS file from here.

This is the code I'm using

df = readRDS('reproduce.RDS')
model = lm(formula = COL_101~.,data=traindf)
predict.lm(model, df[1:5,])

This is my output

> model = lm(formula = COL_101~.,data=df)
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : 
  contrasts can be applied only to factors with 2 or more levels
tubby
  • 2,074
  • 3
  • 33
  • 55
  • You'll have to give us something reproducible; there's something else going on beyond what you've described here. I'll put a counterexample as an answer. – Aaron left Stack Overflow May 12 '15 at 17:54
  • @Aaron I understand, let me try to create something you can reproduce – tubby May 12 '15 at 18:25
  • 1
    See this question on [how to make a great reproducible example](http://stackoverflow.com/q/5963269/210673) for suggestions. – Aaron left Stack Overflow May 12 '15 at 18:28
  • OK, I peeked even though it was on another site. First, you've got 100 observations and 104 predictors, so this isn't even a sensible thing to do. Secondly, your summary lines aren't on the data set after running `na.omit`; that data set has only 13 observations. Please see the link above for suggestions on how to make a great reproducible example; this is something you would have noticed had you followed that advice. – Aaron left Stack Overflow May 13 '15 at 14:40
  • @PepperBoy , do you remember how you solved this? – Bas Oct 13 '15 at 12:02
  • @Heuer, I think what was happening was that since, one of the factors in my training data had only very few instances, so upon doing a 10 fold cross validation, some train data sets were selected without any instances of that factor level, which basically means having just one factor level. You'lll have to increase the number of records for the minority factor level, or not do a 10-fold CV. Basically ensure that both factor levels are adequately represented. – tubby Oct 13 '15 at 15:23

1 Answers1

1

This isn't quite an answer, though possibly could be, if it turns out to demonstrate the issue. I can recreate data that looks like yours, but that works, as follows.

set.seed(5)
df <- data.frame(y=rnorm(100), x=addNA(rep(c(0,1), c(80,20))))
table(df$x)
##   0    1 <NA> 
##  80   20    0 
lm(y~x, data=df)
## Call:
## lm(formula = y ~ x, data = df)
##
## Coefficients:
## (Intercept)           x1  
##    0.007601     0.120172  
Aaron left Stack Overflow
  • 36,704
  • 7
  • 77
  • 142
  • I have placed a reproducible example in the public share mentioned in my edit. Please see my EDIT above. – tubby May 13 '15 at 04:18
  • For posterity, it's greatly preferred to not rely on data or information posted on other sites. Can you reduce it to the simplest possible case and post it here? – Aaron left Stack Overflow May 13 '15 at 14:33