I'm working with a dataset that I'm training with the caret package. My class variable has 7 levels which I create the labels with the dataset documentation. Happened that one of the levels has no data whatsoever in the dataset and I'm having the following error... Error in train.default(x, y, weights = w, ...) : One or more factor levels in the outcome has no data: 'vwnfp'
. The easy way should be just getting rid of that level and that should work. But I'm wondering if in the caret packages is any parameter that can handle this type of situations. I did try to add na.action = 'na.omit'
. I also wonder if utilizing the preProcess
argument can handle this, but I have never use preProcess
before and my attempts are unsuccessful. Here is my code to train the data...
fit.control <- trainControl(method = 'cv', number = 10)
grid <- expand.grid(cp = seq(0, 0.05, 0.005))
trained.tree <- train(Type_of_glass ~ ., data = data.train, method = 'rpart',
trControl = fit.control, metric = 'Accuracy', maximize = TRUE,
tuneGrid = grid, na.action = 'na.omit')
The dataset is in the following url... http://archive.ics.uci.edu/ml/machine-learning-databases/glass/glass.data
This is the code I'm utilizing to manipulate the dataset...
# Loading dataset and transform
data <- read.csv(file = 'data.csv',
head = FALSE)
colnames(data) <- c('Id', 'Ri', 'Na', 'Ma', 'Al',
'Si', 'K', 'Ca', 'Ba', 'Fe',
'Type_of_glass')
str(data)
data <- subset(data, select = -Id)
data$Type_of_glass <- factor(data$Type_of_glass,
levels = c(1, 2, 3, 4, 5, 6, 7),
labels = c('bwfp', 'bwnfp', 'vwfp', 'vwnfp',
'c', 't', 'h'))
str(data)
# Spliting training and test dataset
set.seed(2)
sample.train <- sample(1:nrow(data), nrow(data) * .8)
sample.test <- setdiff(1:nrow(data), sample.train)
data.train <- data[sample.train, ]
data.test <- subset(data[sample.test, ], select = -Type_of_glass)
I don't want to manually get rid of the level because in production, after training, the unseen dataset is pass through the model as is. How can I handle this situation in the dataset?