Feature selection using scikit-learn

Question

I'm new in machine learning. I'm preparing my data for classification using Scikit Learn SVM. In order to select the best features I have used the following method:

SelectKBest(chi2, k=10).fit_transform(A1, A2)

Since my dataset consist of negative values, I get the following error:

ValueError                                Traceback (most recent call last)

/media/5804B87404B856AA/TFM_UC3M/test2_v.py in <module>()
----> 1 
      2 
      3 
      4 
      5 

/usr/local/lib/python2.6/dist-packages/sklearn/base.pyc in fit_transform(self, X, y,     **fit_params)
    427         else:
    428             # fit method of arity 2 (supervised transformation)

--> 429             return self.fit(X, y, **fit_params).transform(X)
    430 
    431 

/usr/local/lib/python2.6/dist-packages/sklearn/feature_selection/univariate_selection.pyc in fit(self, X, y)
    300         self._check_params(X, y)
    301 
--> 302         self.scores_, self.pvalues_ = self.score_func(X, y)
    303         self.scores_ = np.asarray(self.scores_)
    304         self.pvalues_ = np.asarray(self.pvalues_)

/usr/local/lib/python2.6/dist-  packages/sklearn/feature_selection/univariate_selection.pyc in chi2(X, y)
    190     X = atleast2d_or_csr(X)
    191     if np.any((X.data if issparse(X) else X) < 0):
--> 192         raise ValueError("Input X must be non-negative.")
    193 
    194     Y = LabelBinarizer().fit_transform(y)

ValueError: Input X must be non-negative.

Can someone tell me how can I transform my data ?

You could normalise the values to between 0 and 1 or take absolute values perhaps — EdChum, Sep 11 '14 at 15:55
If your data is not non-negative, maybe chi2 is not a good method. You can use f_score. What is the nature of your data? — Andreas Mueller, Sep 12 '14 at 17:07
Thank you EdChum and Andreas. My data consist of min, max, mean, median and FFT of accelerometer signal — sara, Sep 12 '14 at 22:14

score 45 · Answer 1 · answered Oct 06 '17 at 14:37

45

The error message Input X must be non-negative says it all: Pearson's chi square test (goodness of fit) does not apply to negative values. It's logical because the chi square test assumes frequencies distribution and a frequency can't be a negative number. Consequently, sklearn.feature_selection.chi2 asserts the input is non-negative.

You are saying that your features are "min, max, mean, median and FFT of accelerometer signal". In many cases, it may be quite safe to simply shift each feature to make it all positive, or even normalize to [0, 1] interval as suggested by EdChum.

If data transformation is for some reason not possible (e.g. a negative value is an important factor), you should pick another statistic to score your features:

sklearn.feature_selection.f_classif computes ANOVA f-value
sklearn.feature_selection.mutual_info_classif computes the mutual information

Since the whole point of this procedure is to prepare the features for another method, it's not a big deal to pick anyone, the end result usually the same or very close.

answered Oct 06 '17 at 14:37

Maxim

52,561
27
155
209

8

Just use `sklearn.preprocessing.MinMaxScaler().fit_transform(YOUR_TRAINING_FEATURES_HERE)` with the default values to scale your training features to be from 0 to 1 – DonieM Dec 07 '19 at 22:43
"it's not a big deal to pick anyone", just wanted to check that I'm reading you correctly here - do you mean that it's not a bit deal to choose any of `f_classif` , `mutual_info_classif` , or `SelectKBest` – baxx Nov 15 '20 at 14:26
@DonieM I am using that right now, but I have the same error: ...scaler = MinMaxScaler() df1[self.num_features] = scaler.fit_transform(df1[self.num_features]) return df1 – spacedustpi Mar 11 '21 at 18:39
@Maxim - I am encountering similar error. But I filtered my dataframe to include only columns with positive values but I still get the same error. Can you help me please? https://stackoverflow.com/questions/71338163/chi2-score-error-input-x-must-be-non-negative – The Great Mar 03 '22 at 13:42

score 0 · Answer 2 · answered May 18 '23 at 00:01

As others have mentioned to get around the error, you can scale the data to be between 0 and 1, select features from the scaled data and use it to train your model.

import numpy as np
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.preprocessing import MinMaxScaler
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression

X, y = make_classification(random_state=0)
topk = 5

# scale the data to be between 0 and 1
sc = MinMaxScaler()
X_sc = sc.fit_transform(X)

# select from the scaled data
skb = SelectKBest(chi2, k=topk)
X_sc_selected = skb.fit_transform(X_sc, y)

# build model using (X_sc_selected, y)
lr = LogisticRegression(random_state=0)
lr.fit(X_sc_selected, y)

lr.score(X_sc_selected, y)  # 0.87

If the original data is very important (you want to keep the negative values), you can also select data using top-k scores from SelectKBest, i.e. instead of transform-ing the data, slice it.

# fit feature selector with the scaled data
skb = SelectKBest(chi2, k=topk)
skb.fit(X_sc, y)

# column index of top-k features
cols = np.sort(skb.scores_.argsort()[-topk:])
# index the top-k features from X
X_selected = X[:, cols]

# build model using (X_selected, y)
lr = LogisticRegression(random_state=0)
lr.fit(X_selected, y)

lr.score(X_selected, y)  # 0.92

Note that skb.transform() is also really like indexing the columns. For example, (X_sc[:, cols] == X_sc_selected).all() returns True.

Feature selection using scikit-learn

2 Answers2

Linked

Related