-1

During my spark study, I constantly encounter application crash with some trace like "can not find key xxx". After struggling with un-clear message from the crash trace, I found it's because the testing data has some value which the training data didn't have.
For example:
There is a categorical feature contains 4 unique value (1,2,3,4). After split, the data set to training/test data set, the training data only has (1,2,3) of this feature, and the test data set has (..., 4). After training the model, the application will crash when evaluate the model with the test data set.

Is there any best practice for such situation when data pre-processing or is there a way to avoid such situation?


update with more details:

  1. I am training the decision tree with couple of categorical features and numerical features.

  2. The training/test dataset is 70/30, then the evaluation failed with "Caused by: java.util.NoSuchElementException: key not found: 5.0"

  3. Then I changed training/test to 100/30, errors gone.

So I think the issue comes from the missing categorical value in training data, I need an approach to avoid this kinda situation.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
Ryan
  • 309
  • 1
  • 5
  • 15
  • Have you considered stratified sampling ? Here is my answer about it https://stackoverflow.com/a/32241887/3415409 – eliasah Jun 07 '17 at 07:38
  • 1
    stratified sampling could only be a solution in a purely lab environment. When you let a solution loose in the wild you can and will come across the problem the poster describes, and no sampling will fix it. – Ben Horsburgh Jun 07 '17 at 09:07
  • Ryan, can you give us a slightly more specific example? Are you using a model in Spark, or is this related to a scala issue like using a Map? If a category is not defined by a finite known set of values, then you need to look at the problem slightly differently. If it is a known finite set of values, these should be input to the model description from a dictionary / taxonomy / etc, rather than be data driven. – Ben Horsburgh Jun 07 '17 at 09:10
  • tks @eliasah, but does it support muti-features? – Ryan Jun 07 '17 at 09:34
  • It's not. That's not even possible. when you partition data, you always have orders in keys primary, secondary, etc. so at some point you can't prove the distribution of non primary keys among your data set – eliasah Jun 07 '17 at 09:39
  • @eliasah, you mean stratified sampling doesn`t support multiple features? – Ryan Jun 07 '17 at 09:50
  • This is what I'm trying to say https://stats.stackexchange.com/questions/22662/stratified-sampling-with-multiple-variables – eliasah Jun 07 '17 at 09:51

1 Answers1

1

Use stratified sampling.

You split your data set by label, then sample within each label.

Then you join and shuffle all the labels again afterwards.

You can try the same for categorical attributes. But of course you may eventually observe a unique value you have never seen before. A good implementation should not crash on that!

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • if I split the data set by label, say, A B C, and sampling, but there are still chances that I get value of the categorical features which were not in the sample, then still, I won`t pass the test evaluation. – Ryan Jun 08 '17 at 00:48
  • Well, your code must be able to handle this anyway. Because eventually a new category may appear... – Has QUIT--Anony-Mousse Jun 08 '17 at 07:48