How to apply MICE imputations on test set?

Question

I have two separate data sets: one for train (1000000 observation) and the other one for test (1000000 observation). I divided the train set into 3 sets (mytrain: 700000 observations, myvalid: 150000 observations, mytest:150000 observations). Thetest set with 1000000 observations doesn't include the target variable, so it should be used for the final test. Since there are some missing values for categorical variables, I need to use mice to impute them. I should reuse the imputation done on mytrain set to fill the missing values in the myvalid, mytest and test sets. Based on the answer to this question, I should do this:

data2 <- rbind(mytrain,myval,mytest,test)
data2$ST_EMPL <- as.factor(data2$ST_EMPL)
data2$TYP_RES <- as.factor(data2$TYP_RES)

imp <-  mice(data2, method = "cart", m = 1, maxit = 1, seed = 123,
             ignore = c(rep(FALSE, 700000),rep(TRUE, 1300000)))
data2.imp <- complete(imp,1) 
summary(imp)

mytrainN <- data2.imp[1:700000,]
myvalN <- data2.imp[700001:850000,]
mytestN <- data2.imp[850001:1000000,]
testN <- data2.imp[1000001:2000000,]

However, since the test set does not have the target column, it is not possible to merge it with mytrain, mytest, and myvalid. Is it possible to add a hypothetical target column (with the value of say 10 for all 1000000 observations) to the test set?

You should remove the target column before doing imputation. It doesn't make sense to use the target for imputation, when you are trying to infer the target! — shadowtalker, Nov 19 '22 at 01:31
@shadowtalker Thanks a lot. As you said, target variable should be removed, also [here](https://www.numpyninja.com/post/mice-algorithm-to-impute-missing-values-in-a-dataset). However, is it true to use this approach for imputation on validation and test sets? — ebrahimi, Nov 19 '22 at 02:01
@shadowtalker Not necessarily, keeping the output allows for a larger sample size and helps impute missing independent variables as well. See https://stats.stackexchange.com/questions/422876/including-dependent-variables-in-multiple-imputation-model-when-they-have-missin and https://doi.org/10.1186/s12874-016-0281-5 — dcsuka, Nov 19 '22 at 02:02
@shadowtalker [The answer to this](https://stackoverflow.com/questions/33500047/r-mice-machine-learning-re-use-imputation-scheme-from-train-to-test-set) suggests that for out-of-sample prediction we should set the ignore as FALSE, but I am not sure about it. Thanks. — ebrahimi, Nov 19 '22 at 06:12

How to apply MICE imputations on test set?

0 Answers0