4

I have a dataset containing 8 Parameters (4 Continuous 4 Categorical) and I am trying to eliminate features as per RFEC class in Scikit.

This is the formula I am using:

svc = SVC(kernel="linear")
rfecv = RFECV(estimator=svc, step=1, cv=StratifiedKFold(y, 2),
          scoring='accuracy')
rfecv.fit(X, y)

As I have categorical data also, I changed it to the Dummy Variable using dmatrics (Patsy).

I want to try different Classification models on the data after feature selection to improve model along with SVC.

I ran RFE after transforming data and I think I am doing wrong.
Do we run the RFECV before transforming the Categorical data or after?

I can't find any clear indication in any document.

Kara
  • 6,115
  • 16
  • 50
  • 57
Hitesh
  • 43
  • 1
  • 4

1 Answers1

2

It depends on whether you want to select given values of he categorical variable or the whole variable. You are currently selecting single settings (aka levels) of the categorical variable. To select the whole variables, you would probably need to do a bit of hackery, defining your own estimator based on SVC. You could do make_pipeline(OneHotEncoder(categorical_features), SVC()) but then you need to set the coef_ of th pipeline to something that reflects the input shape.

Andreas Mueller
  • 27,470
  • 8
  • 62
  • 74
  • 2
    Thanks Andreas. Actually i haven't tried the OneHotEncoder method and was using Dmatrices to transform the Categorical data. I ran RFEC after transforming the Categorical data and it worked fine (gave me the Optimal number) but when i try running before transforming Categorical data i get the error "Cant convert String to Float" So got confused whether it is possible to run Recursive feature selection on Categorical data before transforming it. Thanks again – Hitesh Apr 10 '15 at 17:34
  • One could argue that this is too strict input validation in RFE. However, as we currently don't really support feature selection on pipelines, I'm not sure there is a good reason to change that. – Andreas Mueller Apr 10 '15 at 18:01
  • Thanks again Sowill it be correct to assume that RFEC can and should be run after Categorical Data transformation only whether by pipeline or some other method? It makes sense now – Hitesh Apr 11 '15 at 02:41
  • Not really. Both are possible, but do different things. But doing it before the transformation doesn't really work out of the box in scikit-learn right now. – Andreas Mueller Apr 13 '15 at 16:36
  • Thanks for clarifying Appreciate it – Hitesh Apr 14 '15 at 17:49