Questions tagged [feature-selection]

In machine learning, this is the process of selecting a subset of most relevant features to construction your data model.

Feature selection is an important step to remove irrelevant or redundant features from our data. For more details, see Wikipedia.

1533 questions
143
votes
7 answers

How are feature_importances in RandomForestClassifier determined?

I have a classification task with a time-series as the data input, where each attribute (n=23) represents a specific point in time. Besides the absolute classification result I would like to find out, which attributes/dates contribute to the result…
user2244670
  • 1,431
  • 2
  • 10
  • 3
82
votes
9 answers

The easiest way for getting feature names after running SelectKBest in Scikit Learn

I'm trying to conduct a supervised machine-learning experiment using the SelectKBest feature of scikit-learn, but I'm not sure how to create a new dataframe after finding the best features: Let's assume I would like to conduct the experiment…
Aviade
  • 2,057
  • 4
  • 27
  • 49
77
votes
4 answers

Linear regression analysis with string/categorical features (variables)?

Regression algorithms seem to be working on features represented as numbers. For example: This data set doesn't contain categorical features/variables. It's quite clear how to do regression on this data and predict price. But now I want to do a…
75
votes
3 answers

Feature/Variable importance after a PCA analysis

I have performed a PCA analysis over my original dataset and from the compressed dataset transformed by the PCA I have also selected the number of PC I want to keep (they explain almost the 94% of the variance). Now I am struggling with the…
fbm
  • 753
  • 1
  • 6
  • 5
52
votes
8 answers

Random Forest Feature Importance Chart using Python

I am working with RandomForestRegressor in python and I want to create a chart that will illustrate the ranking of feature importance. This is the code I used: from sklearn.ensemble import RandomForestRegressor MT= pd.read_csv("MT_reduced.csv") df…
user348547
  • 623
  • 1
  • 6
  • 4
47
votes
2 answers

How is the feature score(/importance) in the XGBoost package calculated?

The command xgb.importance returns a graph of feature importance measured by an f score. What does this f score represent and how is it calculated? Output: Graph of feature importance
ishido
  • 4,065
  • 9
  • 32
  • 42
43
votes
1 answer

Understanding the `ngram_range` argument in a CountVectorizer in sklearn

I'm a little confused about how to use ngrams in the scikit-learn library in Python, specifically, how the ngram_range argument works in a CountVectorizer. Running this code: from sklearn.feature_extraction.text import CountVectorizer vocabulary…
tumultous_rooster
  • 12,150
  • 32
  • 92
  • 149
40
votes
2 answers

Correlated features and classification accuracy

I'd like to ask everyone a question about how correlated features (variables) affect the classification accuracy of machine learning algorithms. With correlated features I mean a correlation between them and not with the target class (i.e the…
39
votes
2 answers

Feature selection using scikit-learn

I'm new in machine learning. I'm preparing my data for classification using Scikit Learn SVM. In order to select the best features I have used the following method: SelectKBest(chi2, k=10).fit_transform(A1, A2) Since my dataset consist of negative…
39
votes
3 answers

Understanding max_features parameter in RandomForestRegressor

While constructing each tree in the random forest using bootstrapped samples, for each terminal node, we select m variables at random from p variables to find the best split (p is the total number of features in your data). My questions (for…
32
votes
3 answers

Information Gain calculation with Scikit-learn

I am using Scikit-learn for text classification. I want to calculate the Information Gain for each attribute with respect to a class in a (sparse) document-term matrix. the Information Gain is defined as H(Class) - H(Class | Attribute), where H is…
32
votes
2 answers

TypeError: only integer arrays with one element can be converted to an index

I'm getting the following error when performing recursive feature selection with cross-validation: Traceback (most recent call last): File "/Users/.../srl/main.py", line 32, in argident_sys.train_classifier() File…
feralvam
  • 1,603
  • 2
  • 17
  • 20
30
votes
8 answers

Scikit-Learn Linear Regression how to get coefficient's respective features?

I'm trying to perform feature selection by evaluating my regressions coefficient outputs, and select the features with the highest magnitude coefficients. The problem is, I don't know how to get the respective features, as only coefficients are…
jeffrey
  • 3,196
  • 7
  • 26
  • 44
29
votes
2 answers

scikit learn - feature importance calculation in decision trees

I'm trying to understand how feature importance is calculated for decision trees in sci-kit learn. This question has been asked before, but I am unable to reproduce the results the algorithm is providing. For example: from StringIO import…
28
votes
2 answers

Should Feature Selection be done before Train-Test Split or after?

Actually, there is a contradiction of 2 facts that are the possible answers to the question: The conventional answer is to do it after splitting as there can be information leakage, if done before, from the Test-Set. The contradicting answer is…
1
2 3
99 100