-3

Referring to Post# Filtering out columns in R , the columns with all 1's and 0's were successfully eliminated from the training_data. However, the classification algorithm still complaint about the columns where MOST of the values are 0's except 1 or 2 (All the values in the column are 0 except 1 or 2 values).

I am using penalizedSVM R package to perform feature selection. Looking more closely at the data set, the function svm.fs complains about the columns where most of the values are 0 except a one or two.

How one can modify (or add something to) the following code to achieve the result.

lambda1.scad<-c(seq(0.01, 0.05, .01), seq(0.1, 0.5, 0.2), 1)
lambda1.scad<-lambda1.scad[2:3]
seed <- 123 

f0 <- function(x) any(x!=1) & any(x!=0) & is.numeric(x)
trainingdata <- lapply(trainingdata, function(data) cbind(label=data$label, 
                            colwise(identity, f0)(data)))

datax <- trainingdata[[1]]
levels(datax$label) <- c(-1, 1)
train_x<-datax[, -1]
train_x<-data.matrix(train_x)
trainy<-datax[, 1]

idx <- is.na(train_x) | is.infinite(train_x)
train_x[idx] <- 0

tryCatch(scad.fix<-svm.fs(train_x, y=trainy, fs.method="scad",
                          cross.outer=0, grid.search="discrete",
                          lambda1.set=lambda1.scad, parms.coding="none",
                          show="none", maxIter=1000, inner.val.method="cv",
                          cross.inner=5, seed=seed, verbose=FALSE), error=function(e) e)

Or one may propose an entirely different solution.

Community
  • 1
  • 1
Shahzad
  • 1,999
  • 6
  • 35
  • 44
  • Please include what you have tried up to this point in future questions. Also, include information such as the classification algorithm you've chosen, maybe there is a parameter you're missing, but we can't help unless we know more! – Justin Mar 05 '13 at 17:29

1 Answers1

1

Use the fact that boolean values can be summed and define some tolerance of zeros:

sum(x == 0) / length(x) >= tolerance

Where this becomes your condition for dropping. However, often zeros are not only valid data, but are critical to the phenomenon being studied. You should think carefully about your algorithm choice and the decision to drop columns before going forward wit this approach.

Justin
  • 42,475
  • 9
  • 93
  • 111