3

I would like to try out the sbf function of package caret in R to perform feature selection and classification with method "ranger" due to very long training time with method "rf".

When I get to the point of performing model training with sbf, I always encounter the error message:

Error in { : task 1 failed - "undefined columns selected" 

For background: My original data set consists of approx. 6200 observations and approx. 15200 features with binary feature representation, which should be reduced to approx. 1700 features. The classification problem is binary.

I made a reproducible sample similar to my original data set and it ends with the same error message. I also added the output and the session info.

Can someone please help me figure out how this problem can be circumvented?

Source code

library(doSNOW)
library(caret)
library(entropy)
library(ranger)

# setup elements for sbf functions
igfit <- caretSBF

# score function
multiigScore <- function(x, y) {
  uniigScore <- function (x, y) {
    library(entropy)
    # make x binary
    xbinary <- as.numeric(x>0)
    ybinary <- as.numeric(y==levels(y)[1])
    # make a joint frequency table
    disc <- discretize2d(xbinary, ybinary, 2, 2, r1=c(0,1), r2=c(0,1))
    # calculate ig score
    ig_score<-mi.empirical(disc)
    as.numeric(ig_score)
  }
  apply(x, 2, uniigScore, y=y)
}

igfit$score <- multiigScore

# filter function
igfit$filter <- function (score, x, y) rank(score, ties.method = "first") <= 5

# data
x <- 0:1
y <- c("a", "b")
train_y <- as.factor(sample(y, 100, replace = T))
train_x <- data.frame(sample(x, 100, replace = T), 
                      sample(x, 100, replace = T), 
                      sample(x, 100, replace = T), 
                      sample(x, 100, replace = T), 
                      sample(x, 100, replace = T), 
                      sample(x, 100, replace = T))
names(train_x) <-c("c", "d", "e", "f", "g", "h")

# control objects
custom_ctrl <- trainControl(method = "none")
sbf_ctrl <- sbfControl(functions = igfit, 
                       method = "cv", number = 10, 
                       multivariate = T,  allowParallel = T, 
                       saveDetails = T, returnResamp = "final", verbose = T)

sbf_fit <- sbf(train_x, train_y, 
               trControl = custom_ctrl,
               sbfControl = sbf_ctrl,
               method = "ranger",
               tuneGrid = expand.grid(mtry=c(2)))

Output

Error in { : task 1 failed - "undefined columns selected"

Session info

R version 3.2.5 (2016-04-14)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=German_Germany.1252 
[2] LC_CTYPE=German_Germany.1252   
[3] LC_MONETARY=German_Germany.1252
[4] LC_NUMERIC=C                   
[5] LC_TIME=German_Germany.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
 [1] randomForest_4.6-12 e1071_1.6-7         ranger_0.5.0       
 [4] entropy_1.2.1       caret_6.0-71        ggplot2_2.1.0      
 [7] lattice_0.20-33     doSNOW_1.0.14       snow_0.4-1         
[10] iterators_1.0.8     foreach_1.4.3      

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.7        magrittr_1.5       splines_3.2.5     
 [4] MASS_7.3-45        munsell_0.4.3      colorspace_1.2-6  
 [7] minqa_1.2.4        stringr_1.1.0      car_2.1-3         
[10] plyr_1.8.4         tools_3.2.5        parallel_3.2.5    
[13] nnet_7.3-12        pbkrtest_0.4-6     grid_3.2.5        
[16] gtable_0.2.0       nlme_3.1-125       mgcv_1.8-12       
[19] quantreg_5.29      class_7.3-14       MatrixModels_0.4-1
[22] lme4_1.1-12        Matrix_1.2-4       nloptr_1.0.4      
[25] reshape2_1.4.1     codetools_0.2-14   stringi_1.1.1     
[28] compiler_3.2.5     scales_0.4.0       stats4_3.2.5      
[31] SparseM_1.72   
wrongturn
  • 41
  • 3
  • Perhaps this will help you: http://stackoverflow.com/questions/18402016/error-when-i-try-to-predict-class-probabilities-in-r-caret I think your error is similar, because you do a lot of transformation between levels in your uniigScore function. – J_F Sep 19 '16 at 14:07
  • @J_F Thanks for your suggestion. I think it is something different, though. The output of the filter function is a named vector as required. It seems to work fine if I use method = "rf" instead method = "ranger" in sbf. – wrongturn Sep 19 '16 at 14:40

1 Answers1

1

I think I found the solution myself:

For sbf to work with "ranger", it is necessary to change custom_ctrl <- trainControl(method = "none") to custom_ctrl <- trainControl(method = "none", classProbs = TRUE). The default for classProbs is FALSE which causes problems when using "ranger".

wrongturn
  • 41
  • 3