I'd like to create a recipe using the recipes package that both imputes missing data and adds indicator columns that indicate which values were missing. It would also be nice if there was an option to choose between including an indicator column for every column in the original data frame or only including indicator columns for columns that had missing data in the original data frame. I know I can easily impute missing values with recipes, but is there a built in way to add missing indicator columns?
For example, if I had a data frame like this:
> data.frame(x = c(1, NA, 3), y = 4:6)
x y
1 1 4
2 NA 5
3 3 6
I would expect that the output after imputation and adding a missing indicator column would look something like this:
x y x_missing
1 1 4 FALSE
2 2 5 TRUE
3 3 6 FALSE
Of course, for a simple example like that, I could do it by hand. But when working with a large data set in a machine learning pipeline, it would be helpful to have an automated way to do it.
According to the docs for recipes::check_missing
, there is a columns
argument,
columns A character string of variable names that will be populated (eventually) by the terms argument.
but I'm not sure what that means, since there is no terms
argument to check_missing
.
For reference, the functionality I'm looking for is implemented in scikit-learn by the MissingIndicator class.