0

I am trying to build a logistic regression model with a response as diagnosis ( 2 Factor variable: B, M). I am getting an Error on building a logistic regression model:

Error in model.matrix.default(mt, mf, contrasts) : 
  variable 1 has no levels

I am not able to figure out how to solve this issue.

R Code:

Cancer <- read.csv("Breast_Cancer.csv")


## Logistic Regression Model

lm.fit <- glm(diagnosis~.-id-X, data = Cancer, family = binomial)
summary(lm.fit)

Dataset Reference: https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

alex_jwb90
  • 1,663
  • 1
  • 11
  • 20
Priyanshu M
  • 3
  • 1
  • 2
  • Hi Priyanshu. Could you provide some more information? For example what is the exact wording of the error you are getting? Also, could you add a [minimal reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610) ? (requiring others to download a external file is not **minimal** and hardly needed to reproduce the error). Providing a mre makes it easier for others to help you. – dario Jul 26 '20 at 07:48

1 Answers1

0

Your problem is similar to the one reported here on the randomForest classifier.
Apparently glm checks through the variables in your data and throws an error because X contains only NA values.

You can fix that error by

  1. either by dropping X completely from your dataset, setting Cancer$X <- NULL before handing it to glm and leaving X out in your formula (glm(diagnosis~.-id, data = Cancer, family = binomial));
  2. or by adding na.action = na.pass to the glm call (which will instruct to ignore the NA-warning, essentially) but still excluding X in the formula itself (glm(diagnosis~.-id-X, data = Cancer, family = binomial, na.action = na.pass))

However, please note that still, you'd have to make sure to provide the diagnosis variable in a form digestible by glm. Meaning: either a numeric vector with values 0 and 1, a logical or a factor-vector

"For binomial and quasibinomial families the response can also be specified as a factor (when the first level denotes failure and all others success)" - from the glm-doc

Just define Cancer$diagnosis <- as.factor(Cancer$diagnosis).

On my end, this still leaves some warnings, but I think those are coming from the data or your feature selection. It clears the blocking errors :)

alex_jwb90
  • 1,663
  • 1
  • 11
  • 20