1

NOW SOLVED. The problem was data=OneT.train, which was wrong. This code was copied over from the original. It needs to be data=OneT in the caret train() function. The current OneT.train had missing values in an attribute field, not the target, from a reloading and splitting of the data, that omitted filling missing values - which was never used.


I'm trying to use the caret package to do repeated k-fold cross validation with C5.0 decision trees.

The following code generates a working C5.0 decision tree (68% accuracy on confusion matrix):

> model <- C5.0(as.factor(OneGM) ~., data=OneT.train)  
> results <- predict(object=model, newdata=OneT.test, type="class")

The caret package code gives these errors (there are no missing values):

> train_control <- trainControl(method="repeatedcv", number=10, repeats=10)  
> model <- train(as.factor(OneGM) ~., data=OneT.train, trControl=train_control, method="C5.0")  
Error in na.fail.default(list(`as.factor(OneGM)` = c(1L, 1L, 1L, 1L,  : missing values in object  
> model <- train(OneGM ~., data=OneT.train, trControl=train_control, method="C5.0")  
Error in na.fail.default(list(OneGM = c(FALSE, FALSE, FALSE, FALSE,  : missing values in object  

The data is loaded from a .csv file, and OneGM is either TRUE or FALSE (a text column in the .csv) - and it does not have any missing values.

I would like to use the one-line caret package approach above (which I've seen used in multiple places), and I'm not looking for solutions that do cross validation manually.

There are no missing values (OneT[is.na(OneT)] <- 0). Here is a sample of the data:

27255   0.259   0.333737266 0.308428966 0   0.017311609 TRUE  
37630   0.258   0.244679265 0.490752807 0   0.024630542 TRUE  
174019  0.143   0.343331217 0.439601992 0.005839996 0.075093867 TRUE  
97817   0.229   0.352818839 0.430965134 0   0.044375645 FALSE  
1293189 0.158   0.248084815 0.620642943 0.007529383 0.081914031 FALSE  
19652   0.259   0.17180665  0.176233943 0   0.02372035  TRUE  
141966  0.13    0.41610721  0.546760618 0.014796511 0.052060738 FALSE  
48990   0.225   0.061461912 0.56626295  0.019634793 0.062931034 TRUE  

Thanks for any help.

  • Hi Larry, the error says there are missing values in the `OneGM` column of `OneT.train`. You may find it helpful to address that issue. If you aren't sure how to do that, it will be much easier to help if you provide at least a sample of your data with `dput(OneT.train)`. See [How to make a reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) for more info. – Ian Campbell Apr 07 '20 at 14:36
  • I generated that column in Excel using an "=A2 –  Apr 07 '20 at 14:43
  • The head() and tail() don't show any gaps in OneGM values either. –  Apr 07 '20 at 14:46
  • The problem was data=OneT.train. This code was copied over from the original. It needs to be data=OneT in the caret train() function. The current OneT.train had missing values in an attribute field, not the target, from a reloading and splitting of the data, that omitted filling missing values - which was never used. –  Apr 07 '20 at 15:46

0 Answers0