During my spark
study, I constantly encounter application crash with some trace like "can not find key xxx
".
After struggling with un-clear message from the crash trace, I found it's because the testing data has some value which the training data didn't have.
For example:
There is a categorical feature contains 4 unique value (1,2,3,4)
.
After split
, the data set to training/test data set, the training data only has (1,2,3) of this feature, and the test data set has (..., 4).
After training the model, the application will crash when evaluate the model with the test data set.
Is there any best practice for such situation when data pre-processing or is there a way to avoid such situation?
update with more details:
I am training the decision tree with couple of categorical features and numerical features.
The training/test dataset is 70/30, then the evaluation failed with "Caused by: java.util.NoSuchElementException: key not found: 5.0"
- Then I changed training/test to 100/30, errors gone.
So I think the issue comes from the missing categorical value in training data, I need an approach to avoid this kinda situation.