0

This is my code for a Logistic regression:

data_raw = data.frame(
    Var1= c(11, 5, 1, 0, 5, 1, 0, 0, 1, 0),
  Var2= c(11, 5, 0, 0, 2, 1, 0, 2, 0, 2),
  Var3= c(10, 7, 15, 9, 16, 9, 13, 15, 11, 17),
  Var4= c(6, 10, 36, 10, 9, 12, 17, 5, 12, 14),
  Var5= c(7, 26, 24, 16, 23, 25, 15, 10, 15, 22),
  Var6= c(0, 0, 1, 0, 0, 2, 1, 0, 0, 0),
  Var7= c(17, 21, 23, 16, 26, 22, 11, 9, 9, 9),
  Var8= c(1, 0, 1, 0, 3, 5, 2, 0, 0, 0),
  Var9= c(3, 0, 3, 3, 2, 0, 1, 3, 3, 2),
  Var10= c(3, 0, 3, 3, 2, 0, 1, 3, 3, 2),
  Var11= c(7, 2, 6, 7, 7, 5, 3, 5, 5, 4),
  Var12= c(4, 3, 3, 4, 2, 3, 4, 8, 7, 5),
  Var13= c("Summer", "Summer", "Summer", "Summer", "Autumn", "Autumn", "Summer", "Summer", "Summer", "Summer"),
  Var14= c("Both host", "Both host", "Both host", "Host Visitor", "Both host", "Both host", "Both host", "Both host", "Both host", "Both host"),
  Var15= c("Home", "Similar", "Similar", "Similar", "Similar", "Similar", "Home", "Home", "Home", "Similar"),
  Winner = c("Win", "Loss", "Win", "Loss", "Win", "Win", "Loss", "Loss", "Win", "Loss"),
  stringsAsFactors = TRUE
)
set.seed(123) # <-- edit
data_shuffled = sample(1:nrow(data_raw))

data_new = data_raw[data_shuffled, ]

create_train_test <- function(data_new, size = 0.8, train = TRUE) {
  n_row = nrow(data_new)
  total_row = size * n_row
  train_sample = 1: total_row
  if (train == TRUE) {
    return (data_new[train_sample, ])
  } else {
    return (data_new[-train_sample, ])
  }
}

data_train <- create_train_test(data_new, size= 0.8, train = TRUE)
data_test <- create_train_test(data_new, size= 0.8, train = FALSE)

mymodel = glm(Winner~., data= data_train, family= binomial)

res2 = predict(mymodel, data= data_test, type="response")
pred2= ifelse(res2>0.5, 1, 0)
tab2= table(data_test$Winner, pred2)

In the final code, I am getting an error that all arguments should have the same length. On inspection, I found that indeed they have different lengths. Why is it happening BTW? I used a different data set and it is working fine. Edit. I have included an example data set.

tpetzoldt
  • 5,338
  • 2
  • 12
  • 29
Renin RK
  • 91
  • 1
  • 1
  • 6
  • Does your data have missing values? It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input that can be used to test and verify possible solutions. We can't really copy/paste a `str()` into R to test it. – MrFlick Jun 08 '21 at 05:07
  • @MrFlick I have included an example data. There are no NA values in the original dataset. – Renin RK Jun 08 '21 at 05:56
  • 1
    Unable to reproduce. `mymodel = glm(Winner~., data= data_train, family= binomial)` produces `Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels`. Please test your code in a *clean* R session. – Limey Jun 08 '21 at 06:06
  • 3
    You have to use `newdata` and not `data` in `predict`: `res2 <- predict(mymodel, newdata= data_test, type="response")` – GKi Jun 08 '21 at 06:10
  • Added `set.seed(123)` to make it reproducible, otherwise it can happen that factor levels go extinct. – tpetzoldt Jun 08 '21 at 06:12
  • @GKi Changed to newdata, but getting error In predict.lm(object, newdata, se.fit, scale = 1, type = if (type == : prediction from a rank-deficient fit may be misleading – Renin RK Jun 08 '21 at 06:17
  • This is not an error, it is a warning, because the sample size is too small. The test set has only `N=2` and can not all factor combinations. I would consider the question to be technically solved, just increase the sample size. @GKi: do you like to post it as answer? – tpetzoldt Jun 08 '21 at 06:20
  • 1
    @ReninRK I get "only" a warning: *prediction from a rank-deficient fit may be misleading*. – GKi Jun 08 '21 at 06:20
  • The main problem with the posted data is that there is only one row with `which(data_raw$Var14 == "Host Visitor")`, row 4. When creating the train and test data sets `Var14` can become a factor with only one level. – Rui Barradas Jun 08 '21 at 06:41

1 Answers1

2

When you use predict you have to use newdata and not data for using a new data set.

res2 <- predict(mymodel, newdata=data_test, type="response")
GKi
  • 37,245
  • 2
  • 26
  • 48