I'm having a problem with partikit
weighted conditional tree models trained on data with missing values.
I'm manually creating a bagged tree model by giving different integer weights to observations at each cycle.
But when I used the bootstrapped models to make predictions, I noticed that some of them were returning less values than the input data rows. Interestingly, out of 299 rows in the input data, the predicted data length was either 299 or 289. 289 is the number of rows after removing predictors with missing data.
Digging down the problem I found that it arises from the interaction of three components:
- Using weights in the model;
- Having missing data in the predictors;
- Using character variables instead of factors in the input data passed to
predict()
If only one of these three conditions is missing the problem doesn't arise and all trees return 299 values.
Here is the data: https://www.dropbox.com/s/98oriv2msce4wu5/anonym_data.rds?dl=0 Here is a script to reproduce the problem: https://www.dropbox.com/s/5y7g2dwt2838pbp/test.R?dl=0