1

I'm working with a dataset that I'm training with the caret package. My class variable has 7 levels which I create the labels with the dataset documentation. Happened that one of the levels has no data whatsoever in the dataset and I'm having the following error... Error in train.default(x, y, weights = w, ...) : One or more factor levels in the outcome has no data: 'vwnfp'. The easy way should be just getting rid of that level and that should work. But I'm wondering if in the caret packages is any parameter that can handle this type of situations. I did try to add na.action = 'na.omit'. I also wonder if utilizing the preProcess argument can handle this, but I have never use preProcess before and my attempts are unsuccessful. Here is my code to train the data...

fit.control <- trainControl(method = 'cv', number = 10)
grid <- expand.grid(cp = seq(0, 0.05, 0.005))
trained.tree <- train(Type_of_glass ~ ., data = data.train, method = 'rpart',
                  trControl = fit.control, metric = 'Accuracy', maximize = TRUE,
                  tuneGrid = grid, na.action = 'na.omit')

The dataset is in the following url... http://archive.ics.uci.edu/ml/machine-learning-databases/glass/glass.data

This is the code I'm utilizing to manipulate the dataset...

 # Loading dataset and transform
data <- read.csv(file = 'data.csv',
             head = FALSE)
colnames(data) <- c('Id', 'Ri', 'Na', 'Ma', 'Al', 
                'Si', 'K', 'Ca', 'Ba', 'Fe', 
                'Type_of_glass')
str(data)
data <- subset(data, select = -Id)
data$Type_of_glass <- factor(data$Type_of_glass, 
                         levels = c(1, 2, 3, 4, 5, 6, 7), 
                         labels = c('bwfp', 'bwnfp', 'vwfp', 'vwnfp', 
                                    'c', 't', 'h'))
str(data)

# Spliting training and test dataset
set.seed(2)
sample.train <- sample(1:nrow(data), nrow(data) * .8)
sample.test <- setdiff(1:nrow(data), sample.train)
data.train <- data[sample.train, ]
data.test <- subset(data[sample.test, ], select = -Type_of_glass)

I don't want to manually get rid of the level because in production, after training, the unseen dataset is pass through the model as is. How can I handle this situation in the dataset?

redeemefy
  • 4,521
  • 6
  • 36
  • 51
  • That's a bad training/test split. If your data isn't large enough to make the chance of such misfortune negligible you should either check post-hoc to make sure it doesn't happen (and re-split if it does) or do a stratified split to guarantee it doesn't happen. – Gregor Thomas Oct 28 '16 at 17:20
  • see the discussion [here](http://stackoverflow.com/questions/4285214/predict-lm-with-an-unknown-factor-level-in-test-data). Basicly delete the unknown level. You do not have data for training with it and as long as it is a factor level most models will require some data for this. – phiver Oct 29 '16 at 09:09
  • I did just that and it worked. I'm wondering if this the approach if the model is in production with millions of roads. – redeemefy Oct 29 '16 at 13:15

0 Answers0