I am trying to obtain a ranking of features from a rather large set of features (~6,100,000) in sklearn. Here's the code I have thus far:
train, test = train_test_split(rows, test_size=0.2, random_state=310)
train, val = train_test_split(train, test_size=0.25, random_state=310)
train_target = [i[-1] for i in train]
svc = SVC(verbose=5, random_state=310, kernel='linear')
svc.fit([i[1:-1] for i in train], train_target)
model=svc
rfe = RFE(model, verbose=5, step=1, n_features_to_select=1)
rfe.fit([i[1:-1] for i in train], train_target)
rank = rfe.ranking_
Each training of the model takes ~10 minutes. for 6,100,000 features that means decades of computation time. Actually 115.9 years. Any better way to do this? I know rfe requires the results of the last elimination, but is there any way I can speed this through parallelizing up or obtain a ranking differently? I can use thousands of nodes (Thanks company I work for!) so any kind of parallelism would be awesome!
I do have the list coefficients of the linear SVM's hyperplane. Ordering those is easy enough, but the paper which this is being done for is going to be reviewed by a Stanford data science professor and he has a strong reservation against using non-ranking algorithms for ranking....and non-Stanford alums like me. :P
I can take a larger step
but that would remove the ability to actually rank all features. rather I would rank groups of 100,000 or 10,000 features which isn't super helpful.
EDIT: nSV might be useful so I've included it below:
obj = -163.983323, rho = -0.999801
nSV = 182, nBSV = 148
Total nSV = 182