-1

I am working on a classification problem. I have around 1000 features and target variable has 2 classes. All the 1000 features have values 1 or 0. I am trying to find feature importance but my feature importance values varies from 0.0 - 0.003. I am not sure if such low value is meaningful.

Is there a way I can increase feature importance.

# Variable importance
rf = RandomForestClassifier(min_samples_split=10, random_state =1)  
rf.fit(X, Y)  
print ("Features sorted by their score:")
a =  (list(zip(map(lambda x: round(x, 3), rf.feature_importances_), X)))

I would really appreciate any help! Thanks

desertnaut
  • 57,590
  • 26
  • 140
  • 166
TigSh
  • 615
  • 1
  • 6
  • 15

1 Answers1

0

Since you only have two target classes you can perform an unequal variance t-test which has been useful to find important features in a binary classification task when all other feature ranking methods have failed me. You can implement this using scipy.stats.ttest_ind function. It basically is a statistical test that checks whether the two distributions are different. if the returned p-value is less than 0.05, they can be assumed to be different distributions. To implement for each feature, follow these steps:

  1. Extract all predictor values for class 1 and 2 respectively.
  2. Run test_ind on these two distributions, specifying that they're variance is unknown, and make sure it's a two tailed t-test
  3. If the p-value is less than 0.05, this feature is important.

Alternatively, you can do this for all your features and use the p-value as the measure of feature importance. The lower, the p-value, the higher the importance of a feature.

Cheers!

Manny
  • 131
  • 3