4

I have a dataset which looks like this

    ID  885038  885039  885040  885041  885042  885043  885044  Class
1267359 2       0       0       0       0       1       0      0
1295720 0       0       0       0       0       1       0      0
1295721 0       0       0       0       0       1       0      0
1295723 0       0       0       0       0       1       0      0
1295724 0       0       0       1       0       1       0      0
1295725 0       0       0       1       0       1       0      0
1295726 2       0       0       0       0       1       0      1
1295727 2       0       0       0       0       1       0      1
1295740 0       0       0       0       0       1       0      1
1295742 0       0       0       0       0       1       0      1
1295744 0       0       0       0       0       1       0      1
1295745 0       0       0       0       0       1       0      1
1295746 0       0       0       0       0       1       0      1

With the intention of doing recursive feature elimination, I followed the steps

  1. Train the SVM classifier
  2. compute the ranking criterion for all features
  3. remove the features with smallest ranking values
  4. Go to 1.

Following is the R code I have written for doing the same, however, it doesn't show any error and the loop continues with the lengths of the training set.

data <- read.csv("dummy - Copy.csv", header = TRUE)
rownames(data) <- data[,1]
data<-data[,-1]

for (k in 1:length(data)){

  inTraining <- createDataPartition(data$Class, p = .70, list = FALSE)
  training <- data[ inTraining,]
  testing  <- data[-inTraining,]

  ## Building the model ####
  svm.model <- svm(Class ~ ., data = training, cross=10,metric="ROC",type="eps-regression",kernel="linear",na.action=na.omit,probability = TRUE)

  ###### auc  measure #######

  #prediction and ROC
  svm.model$index
  svm.pred <- predict(svm.model, testing, probability = TRUE)

  #calculating auc
  c <- as.numeric(svm.pred)
  c = c - 1
  pred <- prediction(c, testing$Class)
  perf <- performance(pred,"tpr","fpr")
  plot(perf,fpr.stop=0.1)
  auc <- performance(pred, measure = "auc")
  auc <- auc@y.values[[1]]

  #compute the weight vector
  w = t(svm.model$coefs)%*%svm.model$SV

  #compute ranking criteria
  weight_matrix = w * w

  #rank the features
  w_transpose <- t(weight_matrix)
  w2 <- as.matrix(w_transpose[order(w_transpose[,1], decreasing = FALSE),])
  a <- as.matrix(w2[which(w2 == min(w2)),]) #to get the rows with minimum values
  row.names(a) -> remove
  data<- data[,setdiff(colnames(data),remove)]
  print(length(data))
  length <- (length(data))
  cols_names <- colnames(data)
  print(auc)
  output <- paste(length,auc,sep=";")
  write(output, file = "output.txt",append = TRUE)
  write(cols_names, file = paste(length,"cols_selected", ".txt", sep=""))
}

The printed output is like

[1] 3
[1] 0.5
[1] 2
[1] 0.5
[1] 2
[1] 0.5
[1] 2
[1] 0.75
[1] 2
[1] 1
[1] 2
[1] 0.75
[1] 2
[1] 0.5
[1] 2
[1] 0.75

But when I pick any of the feature subset, For e.g. Feature 3 and build an SVM model using the above code (without the loop), I don't get the same AUC value of 0.75.

data <- read.csv("3.csv", header = TRUE)
rownames(data) <- data[,1]
data<-data[,-1]

  inTraining <- createDataPartition(data$Class, p = .70, list = FALSE)
  training <- data[ inTraining,]
  testing  <- data[-inTraining,]

  ## Building the model ####
  svm.model <- svm(Class ~ ., data = training, cross=10,metric="ROC",type="eps-regression",kernel="linear",na.action=na.omit,probability = TRUE)

  ###### auc  measure #######

  #prediction and ROC
  svm.model$index
  svm.pred <- predict(svm.model, testing, probability = TRUE)

  #calculating auc
  c <- as.numeric(svm.pred)
  c = c - 1
  pred <- prediction(c, testing$Class)
  perf <- performance(pred,"tpr","fpr")
  plot(perf,fpr.stop=0.1)
  auc <- performance(pred, measure = "auc")
  auc <- auc@y.values[[1]]

  print(auc)

prints output 
    [1] 3
    [1] 0.75 (instead of 0.5)

Both the codes are same (one with a recursive loop, another one is without any recursive loop) still there is a difference in AUC values for the same feature subset.

The 3 features (885041, 885043 and Class) for both the codes is the same, but it gives different AUC values.

Duck Dodgers
  • 3,409
  • 8
  • 29
  • 43
sp2
  • 509
  • 1
  • 5
  • 13
  • 1
    I don't understand why is this question voted down.. I have also updated what I have tried from my end. I have also done a quick search to ensure that it is not repetitive .... This question is related to programming... Thats why asked in Stackoverflow.. – sp2 Jan 02 '19 at 16:53
  • 1
    I agree with you... This question doesn't seem to be unfit – user1083096 Jan 02 '19 at 16:57
  • 5
    If you can provide a reproducible example of your dataset, I believe it would be a good post. – www Jan 02 '19 at 16:58
  • 'It doesn't work' is also a downvote magnet. – Hatt Jan 02 '19 at 17:00
  • I have given a dataset example and also gave the code I have written, for which I don't get any error but no output, how else can I make it reproducible? – sp2 Jan 02 '19 at 17:08
  • @Hatt... since it is not working, I have asked this question – sp2 Jan 02 '19 at 17:13
  • 2
    I like that you researched the question since many, many questions don't bother beforehand. @www is referring to your image - to reproduce your error we need the same data (that we don't have to retype ourselves) to run your code. Also, if nothing happens, in general that is better to provide than 'it doesn't work' - many people actually receive errors but just say 'it doesn't work'. Your answer to www was better than again telling me its not working. – Hatt Jan 02 '19 at 17:19
  • 3
    If SO provided an automatic image-reader that would take tabular bitmapped images and convert to test, there would be no problem, but as it is you are implying that we need to redo that data entry in order to test any modifications or improvements to your coding. Most of us are not sufficiently motivate to redo your data entry. The is the same problem that readers of the Rhelp mailing list faced when someone (you?) posted a similarly title question a couple of days ago. Referencing a csv file wouldn't be reproducible and posting an image file is a slight improvement, but not quite good enough. – IRTFM Jan 02 '19 at 17:25
  • 3
    Thanks, for giving your suggestion, I have incorporated the same in the post. I hope now the code is reproducible. – sp2 Jan 02 '19 at 18:05
  • 1
    @RLave I have edited the question again, to make it reproducible – sp2 Jan 08 '19 at 13:45
  • 1
    @sp2 Take look at https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example to way to reproduce data. In your code thera are two diffrent .csv file names. So maybe this is cause – Marek Jan 09 '19 at 10:11
  • @Marek, I agree with you.. set.seed is the key. – sp2 Jan 13 '19 at 02:18

1 Answers1

5

I think using cross validtion only is fine. In your code, you already use 10-fold CV for test error. Splitting the data set seems needless.

Since you did not mention tuning parameters, cost or gamma would be set as default.

library(tidyverse)
library(e1071)
library(caret)
library(ROCR)
library(foreach)

The feature name is numeric, and it seems that svm() changes the name in it after fitting process. To match after that, I would change the column names first.

Second, the fold can be assigned with caret::creadeFolds() instead of createDataPartition().

set.seed(1)
k <- 5 # 5-fold CV
mydf3 <-
  mydf %>% 
  rename_at(.vars = vars(-ID, -Class), .funs = function(x) str_c("X.", x, ".")) %>% 
  mutate(fold = createFolds(1:n(), k = k, list = FALSE)) # fold id column

# the number of features-------------------------------
x_num <-
  mydf3 %>% 
  select(-ID, -Class, -fold) %>% 
  ncol()

To iterate, foreach() can be another option.

cl <- parallel::makeCluster(2)
doParallel::registerDoParallel(cl, cores = 2)
parallel::clusterExport(cl, c("mydf3", "x_num"))
parallel::clusterEvalQ(cl, c(library(tidyverse), library(ROCR)))
#---------------------------------------------------------------
svm_rank <-
  foreach(j = seq_len(x_num), .combine = rbind) %do% {
    mod <-
      foreach(cv = 1:k, .combine = bind_rows, .inorder = FALSE) %dopar% { # parallization
        tr <-
          mydf3 %>% 
          filter(fold != cv) %>% # train
          select(-fold, -ID) %>% 
          e1071::svm( # fitting svm
           Class ~ .,
           data = .,
           kernel = "linear",
           type = "eps-regression",
           probability = TRUE,
           na.action = na.omit
          )
        # auc
        te <-
          mydf3 %>% 
          filter(fold == cv) %>% 
          predict(tr, newdata = ., probability = TRUE)
        predob <- prediction(te, mydf3 %>% filter(fold == cv) %>% select(Class))
        auc <- performance(predob, measure = "auc")@y.values[[1]]
        # ranking - your formula
        w <- t(tr$coefs) %*% tr$SV
        if (is.null(names(w))) colnames(w) <- attr(tr$terms, "term.labels") # when only one feature left
        (w * w) %>%
          tbl_df() %>%
          mutate(auc = auc)
      }
    auc <- mean(mod %>% select(auc) %>% pull()) # aggregate cv auc
    w_mat <- colMeans(mod %>% select(-auc)) # aggregate cv ranking
    remove <- names(which.min(w_mat)) # minimum rank
    used <-
      mydf3 %>% 
      select(-ID, -Class, -fold) %>% 
      names() %>% 
      str_c(collapse = " & ")
    mydf3 <-
      mydf3 %>%
      select(-remove) # remove feature for next step
    tibble(used = used, delete = remove, auc = auc)
  }
#---------------------------------------------------
parallel::stopCluster(cl)

For each step, you can get

svm_rank
#> # A tibble: 7 x 3
#>   used                                                      delete     auc
#>   <chr>                                                     <chr>    <dbl>
#> 1 X.885038. & X.885039. & X.885040. & X.885041. & X.885042… X.88503…   0.7
#> 2 X.885038. & X.885040. & X.885041. & X.885042. & X.885043… X.88504…   0.7
#> 3 X.885038. & X.885041. & X.885042. & X.885043. & X.885044. X.88504…   0.7
#> 4 X.885038. & X.885041. & X.885043. & X.885044.             X.88504…   0.7
#> 5 X.885038. & X.885041. & X.885043.                         X.88504…   0.7
#> 6 X.885038. & X.885041.                                     X.88503…   0.7
#> 7 X.885041.                                                 X.88504…   0.7
younggeun
  • 923
  • 1
  • 12
  • 19