0

I trying Logistic regression on a dataset. I have successfully divided my dataset into train and test. The regression model also works fine however when I apply it on my test I only get an outcome for 393 observations when the length of my test dataset is 480. How can I compare and get the mismatch or find out what went wrong?

My data has no NAs.

I am trying to create a confusion matrix.

This is my code:

n=nrow(wine_log)
shuffled=wine_log[sample(n),]

train_indices=1:round(0.7*n)
test_indices=(round(0.7*n)+1):n

#Making a new dataset
train=shuffled[train_indices,]
test=shuffled[test_indices,]

wmodel = glm(final_take~., family = binomial, data=train)
summary(wmodel)

result1 = predict(wmodel, newdata = test, type = 'response')
result1 = ifelse(result > 0.5, 1, 0) - Can someone also explain how will removing this affect the outcome?
result1

> table(result1)
result1
  0   1 
255 138 
> table(test$final_take)

 Bad Good 
 418   62 

structure(list(fixed_acid = c(7.4, 7.8, 7.8, 11.2, 7.4, 7.4, 
7.9, 7.3, 7.8, 7.5), vol_acid = c(0.7, 0.88, 0.76, 0.28, 0.7, 
0.66, 0.6, 0.65, 0.58, 0.5), c_acid = c(0, 0, 0.04, 0.56, 0, 
0, 0.06, 0, 0.02, 0.36), res_sugar = c(1.9, 2.6, 2.3, 1.9, 1.9, 
1.8, 1.6, 1.2, 2, 6.1), chlorides = c(0.076, 0.098, 0.092, 0.075, 
0.076, 0.075, 0.069, 0.065, 0.073, 0.071), free_siox = c(11, 
25, 15, 17, 11, 13, 15, 15, 9, 17), total_diox = c(34, 67, 54, 
60, 34, 40, 59, 21, 18, 102), density = c(0.9978, 0.9968, 0.997, 
0.998, 0.9978, 0.9978, 0.9964, 0.9946, 0.9968, 0.9978), pH = c(3.51, 
3.2, 3.26, 3.16, 3.51, 3.51, 3.3, 3.39, 3.36, 3.35), sulphates = c(0.56, 
0.68, 0.65, 0.58, 0.56, 0.56, 0.46, 0.47, 0.57, 0.8), alcohol = c(9.4, 
9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10, 9.5, 10.5), final_take = structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L), .Label = c("Bad", "Good"
), class = "factor")), row.names = c(NA, -10L), class = c("spec_tbl_df", 
"tbl_df", "tbl", "data.frame"),
Bhavna
  • 3
  • 2
  • Can you please provide the data to make this code reproducible for us? Looks like we need `wine_log`. You can refer to this for help in getting us that data: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Harrison Jones Sep 29 '21 at 14:46

1 Answers1

0

Your line of code here:

result1 = ifelse(result > 0.5, 1, 0)

Should be referencing result1 in the ifelse statement. I'm guessing that result is another object you have in your environment that isn't 480 rows.

So you should use this instead.

result1 = ifelse(result1 > 0.5, 1, 0)

You also asked what this line of code is doing. It's basically a threshold for your predictions from the glm model. If the prediction from the model is greater than 0.50, then you are translating the prediction to a "1". If it's less than or equal to 0.50 then you are translating that prediction to a "0". It's a way to convert a probability to a TRUE/FALSE or 1/0.

Harrison Jones
  • 2,256
  • 5
  • 27
  • 34