Error using R caret package (train) with C5.0 decision tree to do K-fold cross validation

Question

NOW SOLVED. The problem was data=OneT.train, which was wrong. This code was copied over from the original. It needs to be data=OneT in the caret train() function. The current OneT.train had missing values in an attribute field, not the target, from a reloading and splitting of the data, that omitted filling missing values - which was never used.

I'm trying to use the caret package to do repeated k-fold cross validation with C5.0 decision trees.

The following code generates a working C5.0 decision tree (68% accuracy on confusion matrix):

> model <- C5.0(as.factor(OneGM) ~., data=OneT.train)  
> results <- predict(object=model, newdata=OneT.test, type="class")

The caret package code gives these errors (there are no missing values):

> train_control <- trainControl(method="repeatedcv", number=10, repeats=10)  
> model <- train(as.factor(OneGM) ~., data=OneT.train, trControl=train_control, method="C5.0")  
Error in na.fail.default(list(`as.factor(OneGM)` = c(1L, 1L, 1L, 1L,  : missing values in object  
> model <- train(OneGM ~., data=OneT.train, trControl=train_control, method="C5.0")  
Error in na.fail.default(list(OneGM = c(FALSE, FALSE, FALSE, FALSE,  : missing values in object

The data is loaded from a .csv file, and OneGM is either TRUE or FALSE (a text column in the .csv) - and it does not have any missing values.

I would like to use the one-line caret package approach above (which I've seen used in multiple places), and I'm not looking for solutions that do cross validation manually.

There are no missing values (OneT[is.na(OneT)] <- 0). Here is a sample of the data:

27255   0.259   0.333737266 0.308428966 0   0.017311609 TRUE  
37630   0.258   0.244679265 0.490752807 0   0.024630542 TRUE  
174019  0.143   0.343331217 0.439601992 0.005839996 0.075093867 TRUE  
97817   0.229   0.352818839 0.430965134 0   0.044375645 FALSE  
1293189 0.158   0.248084815 0.620642943 0.007529383 0.081914031 FALSE  
19652   0.259   0.17180665  0.176233943 0   0.02372035  TRUE  
141966  0.13    0.41610721  0.546760618 0.014796511 0.052060738 FALSE  
48990   0.225   0.061461912 0.56626295  0.019634793 0.062931034 TRUE

Thanks for any help.

Hi Larry, the error says there are missing values in the `OneGM` column of `OneT.train`. You may find it helpful to address that issue. If you aren't sure how to do that, it will be much easier to help if you provide at least a sample of your data with `dput(OneT.train)`. See [How to make a reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) for more info. — Ian Campbell, Apr 07 '20 at 14:36
The head() and tail() don't show any gaps in OneGM values either. — , Apr 07 '20 at 14:46
The problem was data=OneT.train. This code was copied over from the original. It needs to be data=OneT in the caret train() function. The current OneT.train had missing values in an attribute field, not the target, from a reloading and splitting of the data, that omitted filling missing values - which was never used. — , Apr 07 '20 at 15:46

Error using R caret package (train) with C5.0 decision tree to do K-fold cross validation

0 Answers0