0

I have a dataset of 150 samples and almost 10000 features. I have clustered the samples in 6 clusters. I have used sklearn.feature_selection.RFECV method to reduce the number of features. The method estimate the number of important features 3000 features wit ~95% accuracy using 10-fold CV. However I can get ~ 92% accuracy using around 250 features (I have plotted using grid_scores_). Therefore, I would like to get that 250 features.

I have checked that question Getting features in RFECV scikit-learn and found out to calculate the importances of selected features by:

np.absolute(rfecv.estimator_.coef_)

which returns an array length of number of important features for binary classifications. As i indicated before, i have 6 clusters and sklearn.feature_selection.RFECV does classifiacation 1 vs 1. Therefore i get (15, 3000) ndarray. I do not know how to proceed. I was thinking to take dot product for each feature like that:

cofs = rfecv.estimator_.coef_

coeffs = []

for x in range(cofs.shape[1]):

    vec = cofs[ : , x]

    weight = vec.transpose() @ vec 

    coeffs.append(weight)

And i get array of (1,3000). I can sort these and get the results i want. But i am not sure whether it is right and makes sense. I really appreciate any other solutions.

arta
  • 11
  • 4

1 Answers1

0

Well i delved into the source code. Here what i found, actually they are doing pretty much same thing:

# Get ranks
if coefs.ndim > 1:
    ranks = np.argsort(safe_sqr(coefs).sum(axis=0))
else:
    ranks = np.argsort(safe_sqr(coefs))

If it is multi-class problem, they sum up coefficients. Hope that helps others.

arta
  • 11
  • 4