2

I have two separate data sets: one for train (1000000 observation) and the other one for test (1000000 observation). I divided the train set into 3 sets (mytrain: 700000 observations, myvalid: 150000 observations, mytest:150000 observations). Thetest set with 1000000 observations doesn't include the target variable, so it should be used for the final test. Since there are some missing values for categorical variables, I need to use mice to impute them. I should reuse the imputation done on mytrain set to fill the missing values in the myvalid, mytest and test sets. Based on the answer to this question, I should do this:

data2 <- rbind(mytrain,myval,mytest,test)
data2$ST_EMPL <- as.factor(data2$ST_EMPL)
data2$TYP_RES <- as.factor(data2$TYP_RES)

imp <-  mice(data2, method = "cart", m = 1, maxit = 1, seed = 123,
             ignore = c(rep(FALSE, 700000),rep(TRUE, 1300000)))
data2.imp <- complete(imp,1) 
summary(imp)

mytrainN <- data2.imp[1:700000,]
myvalN <- data2.imp[700001:850000,]
mytestN <- data2.imp[850001:1000000,]
testN <- data2.imp[1000001:2000000,]

However, since the test set does not have the target column, it is not possible to merge it with mytrain, mytest, and myvalid. Is it possible to add a hypothetical target column (with the value of say 10 for all 1000000 observations) to the test set?

ebrahimi
  • 912
  • 2
  • 13
  • 32
  • You should remove the target column before doing imputation. It doesn't make sense to use the target for imputation, when you are trying to infer the target! – shadowtalker Nov 19 '22 at 01:31
  • @shadowtalker Thanks a lot. As you said, target variable should be removed, also [here](https://www.numpyninja.com/post/mice-algorithm-to-impute-missing-values-in-a-dataset). However, is it true to use this approach for imputation on validation and test sets? – ebrahimi Nov 19 '22 at 02:01
  • @shadowtalker Not necessarily, keeping the output allows for a larger sample size and helps impute missing independent variables as well. See https://stats.stackexchange.com/questions/422876/including-dependent-variables-in-multiple-imputation-model-when-they-have-missin and https://doi.org/10.1186/s12874-016-0281-5 – dcsuka Nov 19 '22 at 02:02
  • How do you make out-of-sample predictions then? – shadowtalker Nov 19 '22 at 05:58
  • @shadowtalker [The answer to this](https://stackoverflow.com/questions/33500047/r-mice-machine-learning-re-use-imputation-scheme-from-train-to-test-set) suggests that for out-of-sample prediction we should set the ignore as FALSE, but I am not sure about it. Thanks. – ebrahimi Nov 19 '22 at 06:12

0 Answers0