predict function returns wrong number of predictions and don't obtain the confusion matrix

Question

In r I've a dataframe called dtab, I report here a very little part:

structure(list(ID = 1:10, X9Profit = c(21L, -6L, -49L, -4L, -61L, 
-38L, -19L, 59L, 493L, -158L), X9Online = c(0L, 0L, 1L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L), X9Age = c(NA, 6L, 5L, NA, 2L, NA, 3L, 5L, 
4L, 6L), X9Inc = c(NA, 3L, 5L, NA, 9L, 3L, 1L, 8L, 9L, 8L), X9Tenure =c(6.33, 
29.5, 26.41, 2.25, 9.91, 2.33, 8.41, 7.33, 15.33, 4.33), X9District =c(1200L, 
1200L, 1100L, 1200L, 1200L, 1300L, 1300L, 1200L, 1200L, 1100L
), X0Profit = c(NA, -32L, -22L, NA, -4L, 14L, 0L, -65L, 855L, 
-20L), X0Online = c(NA, 0L, 1L, NA, 0L, 0L, 0L, 0L, 0L, 0L), 
X9Billpay = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), X0Billpay = c(NA, 
0L, 0L, NA, 0L, 0L, 0L, 0L, 0L, 0L)), .Names = c("ID", "X9Profit", 
"X9Online", "X9Age", "X9Inc", "X9Tenure", "X9District", "X0Profit", 
"X0Online", "X9Billpay", "X0Billpay"), row.names = c(NA, 10L), class ="data.frame")

I renamed some variables (this part is right):

N=dim(dtab)[1]
Profit9=dtab$X9Profit
Online9=dtab$X9Online
Age=dtab$X9Age
Income=dtab$X9Inc
Tenure=dtab$X9Tenure
District=dtab$X9District
Profit0=dtab$X0Profit
Online0=dtab$X0Online
District1100 = ifelse(District==1100,1,0)
District1200 = ifelse(District==1200,1,0)
AgeGiven = ifelse(is.na(Age),0,1)
AgeZero = ifelse(is.na(Age),0,Age)
IncomeZero = ifelse(is.na(Income),0,Income)
IncomeGiven = ifelse(is.na(Income),0,1)
Retain=ifelse(is.na(Profit0),0,1)
Retain=as.factor(Retain)

Retain is a dummy variable that can assume 0 or 1 as value. I want the confusion matrix of this logistic model (we have N observations)

Retain~Profit9+Online9+AgeZero+IncomeZero+Tenure

So I've done

set.seed(1)
x=sample(1:N, N/2,replace = FALSE)
training=dtab$ID %in% x

This generates a logical vector to make the train set, this seems ok.

testing=!(training)

Retain_testing=dtab$Retain[testing]

model=glm(Retain~Profit9+Online9+AgeZero+IncomeZero+Tenure,
data=dtab[training,],family=binomial)

Now I've a warning but THIS IS NOT A PROBLEM because occurs with this really little subset of data (for not having it I should put about 100 observations)
```
Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred 
```
model_pred_probs=predict(model,newdata=dtab[testing,], type='response')

1° PROBLEM: too much predictions

Warning message:'newdata' had 5 rows but variables found have 10 rows

I've to make the confusion matrix so I've done:

model_pred_Retain=rep(0,N/2)
model_pred_Retain[model_pred_probs>0.5]=1
table(model_pred_Retain, Retain_testing)

2° PROBLEM: (probably linked to the first one)

Error in table(model_pred_Retain, Retain_testing) : all argoments should have the same length

I've checked everywhere but I don't see what's the problem.

Check that `model_pred_Retain` is a vector of 0s and 1s and has the same length as `x` (or whatever the observed values are). If so use `with(DATI,table(x,model_pred_Retain))`. The diagonal refers to correct classification and off diagonal misclassification. — User7598, Jul 28 '15 at 18:20
It's a vector of the right length (N/2) but it's made only by 1s, if I do with(DATI,table(x,model_pred_Retain)) I get two long colums of numbers: the first is made by the numbers in 1:N the second by 1s. — alrac-ailem, Jul 28 '15 at 18:40
Once you get the predictions sorted out, to generate the confusion matrix, you could also use `confusionMatrix()` from the `caret` package. — ulfelder, Jul 28 '15 at 20:14
I've also tried to apply that function directly to the data frame in this way `confusionMatrix(data=DATI,reference=DATA[training,],positive=0 )` but it returns `Error in sort.list(y) : 'x' must be atomic for 'sort.list' Have you called 'sort' on a list?` — alrac-ailem, Jul 28 '15 at 20:30
You should update your question with a small amount of your data using `dput()`. It'll be easier to problem solve if the problem is reproducible. — User7598, Jul 28 '15 at 20:47
You should provide a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input so that we can run and test the code as well and test any possible solutions. — MrFlick, Jul 28 '15 at 20:50
Resolution: I've create a new data frame with the same data and I've renominated all the columns and it works! The problem with the confusion matrix was not a real problem, it was just that my data were really bad, so I've increased the threshold and all works perfectly! — alrac-ailem, Jul 29 '15 at 08:19

predict function returns wrong number of predictions and don't obtain the confusion matrix

0 Answers0