1

I am training Logistic Regression in R. I use train set and test set. I have some data and binary output. In a data file the output is the integers 1 or 0 without missing values. I have more 1 than 0 (the proportion is 70/30).

The result of LR is very different depending on if I factories the output or not, namely if I keep output variable as numeric 0-1 and I write

m1 <- glm(output~.,data=dt_tr,family=binomial())

then I get something without warnings and errors and if I write

dt$output<-as.factor(ifelse(dt$output == 1, "Good", "Bad"))
m1 <- glm(output~.,data=dt_tr,family=binomial())

I get completely different performance! What could it be?

To be more precise, after training LR I do the following:

score <- predict(m1,type='response',dt_test)
m1_pred <- prediction(m1_score, dt_test$output)
m1_perf <- performance(m1_pred,"tpr","fpr")
#ROC
plot(m1_perf, lwd=2, main="ROC")

I get very different ROCs and AUCs.

KimMik
  • 11
  • 2

2 Answers2

0

Without seeing your data, I would suggest that changing your result variable to a factor is causing the problem.

Your original data are binary 1/0, meaning, when they are processed as numbers during the regression they are treated as literally 1 and 0. But when you turn them into factors, the factors are numerically treated as 1 and 2:

x <- c(0, 1, 1, 0, 0, 1, 1)
y <- as.factor(ifelse(x == 1, "Good", "Bad"))
as.numeric(y)
[1] 1 2 2 1 1 2 2
mmyoung77
  • 1,343
  • 3
  • 14
  • 22
0

It was my silly mistake. I just forgot to set seed. The only think I would like to add, that if you deal with Random Forest then you must factorise the output, otherwise R will treat it as a numerical data.

KimMik
  • 11
  • 2