60

When I try to define my linear model in R as follows:

lm1 <- lm(predictorvariable ~ x1+x2+x3, data=dataframe.df)

I get the following error message:

Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : 
contrasts can be applied only to factors with 2 or more levels 

Is there any way to ignore this or fix it? Some of the variables are factors and some are not.

Jilber Urbina
  • 58,147
  • 10
  • 114
  • 138
REnthusiast
  • 1,591
  • 3
  • 16
  • 18
  • I got this error when attempting to build a linear model for (price ~ year) when year was categorical rather than numeric. – duhaime Mar 14 '19 at 15:58

9 Answers9

79

If your independent variable (RHS variable) is a factor or a character taking only one value then that type of error occurs.

Example: iris data in R

(model1 <- lm(Sepal.Length ~ Sepal.Width + Species, data=iris))

# Call:
# lm(formula = Sepal.Length ~ Sepal.Width + Species, data = iris)

# Coefficients:
#       (Intercept)        Sepal.Width  Speciesversicolor   Speciesvirginica  
#            2.2514             0.8036             1.4587             1.9468  

Now, if your data consists of only one species:

(model1 <- lm(Sepal.Length ~ Sepal.Width + Species,
              data=iris[iris$Species == "setosa", ]))
# Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : 
#   contrasts can be applied only to factors with 2 or more levels

If the variable is numeric (Sepal.Width) but taking only a single value say 3, then the model runs but you will get NA as coefficient of that variable as follows:

(model2 <-lm(Sepal.Length ~ Sepal.Width + Species,
             data=iris[iris$Sepal.Width == 3, ]))

# Call:
# lm(formula = Sepal.Length ~ Sepal.Width + Species, 
#    data = iris[iris$Sepal.Width == 3, ])

# Coefficients:
#       (Intercept)        Sepal.Width  Speciesversicolor   Speciesvirginica  
#             4.700                 NA              1.250              2.017

Solution: There is not enough variation in dependent variable with only one value. So, you need to drop that variable, irrespective of whether that is numeric or character or factor variable.

Updated as per comments: Since you know that the error will only occur with factor/character, you can focus only on those and see whether the length of levels of those factor variables is 1 (DROP) or greater than 1 (NODROP).

To see, whether the variable is a factor or not, use the following code:

(l <- sapply(iris, function(x) is.factor(x)))
# Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
#        FALSE        FALSE        FALSE        FALSE         TRUE 

Then you can get the data frame of factor variables only

m <- iris[, l]

Now, find the number of levels of factor variables, if this is one you need to drop that

ifelse(n <- sapply(m, function(x) length(levels(x))) == 1, "DROP", "NODROP")

Note: If the levels of factor variable is only one then that is the variable, you have to drop.

Max Ghenis
  • 14,783
  • 16
  • 84
  • 132
Metrics
  • 15,172
  • 7
  • 54
  • 83
  • OK thanks. Is there any way that i can fix this in R or is it the original data that needs to be edited. Also, having looked through the data all of the variables take more than one value? Is there any way to see which specific variables they are referring to? – REnthusiast Aug 11 '13 at 11:14
  • 2
    Also - if your variable contains "exotic" characters, the same error will show up. Which I guess is a bug. My variable CustomerType hade one value which contained an "ö", when I changed that the error disappeared – ErrantBard Oct 18 '16 at 12:18
  • 5
    Your last `ifelse` does not work. A variable can have 2 levels, but if one of them is empty, you will get an error, but your code will not detect it. With a data frame `df`, a better formula would be: `which(sapply(df, function(x) length(unique(x))<2))` which lists out the variables that are problematic. – Roobie Nuby May 14 '17 at 23:10
20

It appears that at least one of your predictors ,x1, x2, or x3, has only one factor level and hence is a constant.

Have a look at

lapply(dataframe.df[c("x1", "x2", "x3")], unique)

to find the different values.

Sven Hohenstein
  • 80,497
  • 17
  • 145
  • 168
9

This error message may also happen when the data contains NAs.

In this case, the behaviour depends on the defaults (see documentation), and maybe all cases with NA's in the columns mentioned in the variables are silently dropped. So it may be that a factor does indeed have several outcomes, but the factor only has one outcome when restricting to the cases without NA's.

In this case, to fix the error, either change the model (remove the problematic factor from the formula), or change the data (i.e. complete the cases).

jarauh
  • 1,836
  • 22
  • 30
6

The answers by the other authors have already addressed the problem of factors with only one level or NAs.

Today, I stumbled upon the same error when using the rstatix::anova_test() function but my factors were okay (more than one level, no NAs, no character vectors, ...). Instead, I could fix the error by dropping all variables in the dataframe that are not included in the model. I don't know what's the reason for this behavior but just knowing about this might also be helpful when encountering this error.

Tee
  • 113
  • 1
  • 6
  • 2
    You just solved my problem. This must be a bug of sorts, you'd think the function should be able to ignore the other columns – MonikaP Oct 12 '20 at 13:31
  • Thank you! Had exactly this problem with `rstatix::anova_test()` and despite the error message pointing elsewhere, this was indeed the cause. – Michael MacAskill Dec 22 '20 at 03:03
  • This issue seems to be fixed in the latest rstatix package (0.7.0) – ekatko1 May 08 '21 at 13:03
3

Metrics and Svens answer deals with the usual situation but for us who work in non-english enviroments if you have exotic characters (å,ä,ö) in your character variable you will get the same result, even if you have multiple factor levels.

Levels <- c("Pri", "För") gives the contrast error, while Levels <- c("Pri", "For") doesn't

This is probably a bug.

ErrantBard
  • 1,421
  • 1
  • 21
  • 40
  • 1
    Thanks for this suggestion. I'm getting this error with two of my factor variables despite having thoroughly checked that more than one level is being passed to the model and I wondered if it's due to my data originating from a non-English environment. However the levels contain no exotic characters, and recoding them doesn't solve the probem. – Joe Aug 20 '18 at 13:57
1

This is a variation to the answer provided by @Metrics and edited by @Max Ghenis...

l <- sapply(iris, function(x) is.factor(x))
m <- iris[,l]

n <- sapply( m, function(x) { y <- summary(x)/length(x)
len <- length(y[y<0.005 | y>0.995])
cbind(len,t(y))} )

drop_cols_df <- data.frame(var = names(l[l]), 
                           status = ifelse(as.vector(t(n[1,]))==0,"NODROP","DROP" ),
                           level1 = as.vector(t(n[2,])),
                           level2 = as.vector(t(n[3,])))

Here, after identifying factor variables, the second sapply computes what percent of records belong to each level / category of the variable. Then it identifies number of levels over 99.5% or below 0.5% incidence rate (my arbitrary thresholds).

It then goes on to return the number of valid levels and the incidence rate of each level in each categorical variable.

Variables with zero levels crossing the thresholds should not be dropped, while the other should be dropped from the linear model.

The last data frame makes viewing the results easy. It's hard coded for this data set since all factor variables are binomial. This data frame can be made generic easily enough.

dk_b
  • 11
  • 1
1

If the error happens to be because your data has NAs, then you need to set the glm() function options of how you would like to treat the NA cases. More information on this is found in a relevant post here: https://stats.stackexchange.com/questions/46692/how-the-na-values-are-treated-in-glm-in-r

Sandy
  • 1,100
  • 10
  • 18
1

From my experience ten minutes ago this situation can happen where there are more than one category but with a lot of NAs. Taking the Kaggle Houseprice Dataset as example, if you loaded data and run a simple regression,

train.df = read.csv('train.csv')
lm1 = lm(SalePrice ~ ., data = train.df)

you will get same error. I also tried testing the number of levels of each factor, but none of them says it has less than 2 levels.

cols = colnames(train.df)
for (col in cols){
  if(is.factor(train.df[[col]])){
    cat(col, ' has ', length(levels(train.df[[col]])), '\n')
  }
}

So after a long time I used summary(train.df) to see details of each col, and removed some, and it finally worked:

train.df = subset(train.df, select=-c(Id, PoolQC,Fence, MiscFeature, Alley, Utilities))
lm1 = lm(SalePrice ~ ., data = train.df)

and removing any one of them the regression fails to run again with same error (which I have tested myself).

And above attributes generally have 1400+ NAs and 10 useful values, so you might want to remove these garbage attributes, even they have 3 or 4 levels. I guess a function counting how many NAs in each column will help.

1

I had the same problem when some values columns were integers and others numerical. Changing all numericals to integer solved the issue (Don't know if it impacts analysis though).

tvs290
  • 11
  • 1
  • This does not really answer the question. If you have a different question, you can ask it by clicking [Ask Question](https://stackoverflow.com/questions/ask). To get notified when this question gets new answers, you can [follow this question](https://meta.stackexchange.com/q/345661). Once you have enough [reputation](https://stackoverflow.com/help/whats-reputation), you can also [add a bounty](https://stackoverflow.com/help/privileges/set-bounties) to draw more attention to this question. - [From Review](/review/late-answers/29924256) – Vinícius Félix Sep 26 '21 at 19:36