6

I am trying to obtain a ranking of features from a rather large set of features (~6,100,000) in sklearn. Here's the code I have thus far:

train, test = train_test_split(rows, test_size=0.2, random_state=310)
train, val = train_test_split(train, test_size=0.25, random_state=310)
train_target = [i[-1] for i in train]

svc = SVC(verbose=5, random_state=310, kernel='linear')
svc.fit([i[1:-1] for i in train], train_target)

model=svc
rfe = RFE(model, verbose=5, step=1, n_features_to_select=1)
rfe.fit([i[1:-1] for i in train], train_target)
rank = rfe.ranking_

Each training of the model takes ~10 minutes. for 6,100,000 features that means decades of computation time. Actually 115.9 years. Any better way to do this? I know rfe requires the results of the last elimination, but is there any way I can speed this through parallelizing up or obtain a ranking differently? I can use thousands of nodes (Thanks company I work for!) so any kind of parallelism would be awesome!

I do have the list coefficients of the linear SVM's hyperplane. Ordering those is easy enough, but the paper which this is being done for is going to be reviewed by a Stanford data science professor and he has a strong reservation against using non-ranking algorithms for ranking....and non-Stanford alums like me. :P

I can take a larger step but that would remove the ability to actually rank all features. rather I would rank groups of 100,000 or 10,000 features which isn't super helpful.

EDIT: nSV might be useful so I've included it below:

obj = -163.983323, rho = -0.999801
nSV = 182, nBSV = 148
Total nSV = 182
Joe B
  • 912
  • 2
  • 15
  • 36
  • How about PCA or random projections? – Lukasz Tracewski Feb 21 '19 at 22:00
  • @LukaszTracewski How does one use PCA to rank features? – Joe B Feb 21 '19 at 22:18
  • I'd check which features contribute most: https://stackoverflow.com/questions/40295888/how-to-find-most-contributing-features-to-pca – Lukasz Tracewski Feb 22 '19 at 07:19
  • You could check the correlation of each feature with your output using sklearn's correlation matrix. Sort it and select the most correlated N number of features? – Achintha Ihalage Feb 22 '19 at 11:16
  • I have 130 features and RFE is taking more than 30 minutes. Is there a way to use RFE to determine the best number of features? – taga Aug 18 '19 at 10:12
  • 1
    @taga I would consider letting it run till completion as 30 minutes isn't too much. Alternatively, you can remove more than one feature with each recursion. This would serve to speed up runtime by approximately n-fold (where n is the number of features you remove at each recursion). – Joe B Aug 19 '19 at 20:49

1 Answers1

1

You should use a different algorithm. There has been a lot of research on how to speed up feature selection. The RFE's computational complexity is prohibitive for a large set of features. You should consider using appoaches for high dimentional data, such as FBED (Forward-Backward-Early-Dropping), OMP (Orthogonal-Matching-Pursuit), SES (Statistically-Equivalent-Signatures), LASSO etc.

Fbed https://arxiv.org/abs/1705.10770

OMP https://arxiv.org/abs/2004.00281

SES https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2023-7

LASSO https://ieeexplore.ieee.org/document/7887916

Stefanos
  • 909
  • 12
  • 19
  • Thanks for the suggestions but we were looking to obtain a ranking. Any ideas of better algorithms on that end? – Joe B Jun 16 '20 at 19:00
  • 1
    If you would like to obtain a ranking of the most significant (i.e. selected) features then you can do it using the suggested algorithms. But if you would like to obtain a ranking on all the features then it is not trivial on how to do it. You will have to better define the desired output of your experiment. For example you can test each single feature for unconditional independence with your target variable, and then rank them using the test's significance output (statistic). This method though, will not test for conditional dependence with the target. – Stefanos Jun 17 '20 at 10:37