0

i have dataset with predictors, it has 35446 rows and 38 columns with target.

I make train and test sets:

data_for_predict <- res
data_good <- data_for_predict%>%filter(target == 1)
data_bad <- data_for_predict%>%filter(target == 0)
set.seed(789)
size_bad <- floor(1 * nrow(data_good))
data_ind <- sample(seq_len(nrow(data_bad)), size = size_bad)
data_bad <- data_bad[data_ind, ]
data_for_predict <- rbind(data_good, data_bad)
data_for_predict <- data_for_predict[sample(1:nrow(data_for_predict)),]
goal <- as.data.frame(data_for_predict$target)
data_for_predict <- data_for_predict%>%select(-target)

After that, i want to reduce the dimensionality of the data with PCA.

 PCA <- prcomp(data_for_predict, scale. = TRUE)
 PCA <- as.data.frame(PCA$x)
 data_for_predict <- cbind(data_for_predict, PCA)
 data_for_predict <- as.data.frame(data_for_predict)
 data_for_predict$target <- target$`data_for_predict$target`

Then I break the data into the training and test samples

smp_size <- floor(0.8 * nrow(data_for_predict))
set.seed(123)
train_ind <- sample(seq_len(nrow(data_for_predict)), size = smp_size)

train <- data_for_predict[train_ind, ]
rownames(train) <- seq(length=nrow(train)) 
test <- data_for_predict[-train_ind, ]
rownames(test) <- seq(length=nrow(test)) 
names(test) <- make.names(names(test))
names(train) <- make.names(names(train))

Now I prepare the data for training

setDT(train)
setDT(test)
labels <- train$target
ts_label <- test$target
new_tr <- model.matrix(~.+0,data = train[,-c("target"),with=F]) 
new_ts <- model.matrix(~.+0,data = test[,-c("target"),with=F])
dtrain <- xgb.DMatrix(data = new_tr,label = labels) 
dtest <- xgb.DMatrix(data = new_ts,label=ts_label)

And fit:

params <- list(booster = "gbtree", objective = "binary:logistic", eta=0.3, gamma=0, max_depth=10, min_child_weight=1, subsample=1, colsample_bytree=1)


xgbcv <- xgb.cv(params = params, data = dtrain, nrounds = 1000, nfold = 5, showsd = T, stratified = T, print_every_n = 10, 
                early_stopping_round = 20, maximize = F, eval_metric = "error")
xgb1 <- xgb.train(params = params, data = dtrain, nrounds = 46, watchlist = list(val=dtest,train=dtrain), print_every_n = 10, 
                   maximize = F , eval_metric = "error")

xgbpred <- predict(xgb1, dtest, type = "response")
xgbpred <- ifelse(xgbpred > 0.77,1,0)

confusionMatrix(xgbpred, ts_label)

I get a good result:

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 1569   90
         1  102 1583

               Accuracy : 0.9426             
                 95% CI : (0.9342, 0.9502)   
    No Information Rate : 0.5003             
    P-Value [Acc > NIR] : <0.0000000000000002

                  Kappa : 0.8852             
 Mcnemar's Test P-Value : 0.4273             

            Sensitivity : 0.9390             
            Specificity : 0.9462             
         Pos Pred Value : 0.9458             
         Neg Pred Value : 0.9395             
             Prevalence : 0.4997             
         Detection Rate : 0.4692             
   Detection Prevalence : 0.4961             
      Balanced Accuracy : 0.9426             

       'Positive' Class : 0    

But if I want to predict the entire dataset (35446 rows and 38 columns) based on the model obtained, i've got:

Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 22386  5328
         1  4701  3031

               Accuracy : 0.7171          
                 95% CI : (0.7123, 0.7217)
    No Information Rate : 0.7642          
    P-Value [Acc > NIR] : 1               

                  Kappa : 0.1941          
 Mcnemar's Test P-Value : 0.000000000408  

            Sensitivity : 0.8264          
            Specificity : 0.3626          
         Pos Pred Value : 0.8078          
         Neg Pred Value : 0.3920          
             Prevalence : 0.7642          
         Detection Rate : 0.6316          
   Detection Prevalence : 0.7819          
      Balanced Accuracy : 0.5945          

       'Positive' Class : 0 

Why is the error reduced if the model is built on the same data?

AntonCH
  • 272
  • 2
  • 3
  • 13
  • if you have a binary response {0...1}, why did you set your initial cutoff at 0.77? Are you sure you are not introducing a bias here? `xgbpred <- ifelse(xgbpred > 0.77,1,0)`. Maybe, if you set it at 0.5, you'll get a better result on the training set (overfitted!) and a poorer result on the test set... – Damiano Fantini Aug 29 '17 at 09:46
  • Would you make your example reproducible https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – csgillespie Aug 29 '17 at 09:46
  • @DamianoFantini I check this result in the cycle I choose the best, in this I'm 100% sure. – AntonCH Aug 29 '17 at 09:48
  • Did you chose the best threshold on the training or on the test? You are supposed to pick that threshold on the training and see if it works nicely on the test. Please, can you run this: `xgbpred2 <- predict(xgb1, dtrain, type = "response"); xgbpred2 <- ifelse(xgbpred2 > 0.77,1,0); confusionMatrix(xgbpred2, labels)` and report the result? I have the feeling this may explain something... – Damiano Fantini Aug 29 '17 at 10:09
  • @DamianoFantini 100% accuracy – AntonCH Aug 29 '17 at 10:16
  • @DamianoFantini I have a feeling that since I'm doing the balance of classes for learning and predicting, and in the original dataset they are very unbalanced, because of this, poor accuracy. But I do not know how to fix it. – AntonCH Aug 29 '17 at 10:18
  • can you use weights? – Damiano Fantini Aug 29 '17 at 10:19
  • @DamianoFantini Please explain in detail I did not understand – AntonCH Aug 29 '17 at 10:22
  • xgbtrain accepts weights, I thinnk. This is a way to correct for class imbalance instead of using under/over-sampling techniques. Briefly, you increase the weight of your minority class in the model. I think you can find an example here... https://datascience.stackexchange.com/questions/9488/xgboost-give-more-importance-to-recent-samples – Damiano Fantini Aug 29 '17 at 10:25
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/153119/discussion-between-antonch-and-damiano-fantini). – AntonCH Aug 29 '17 at 10:26

0 Answers0