3

I was trying auto feature engineering and selecting, so for that, I used the Boston house price dataset available in sklearn.

from sklearn.datasets import load_boston
import pandas as pd
data = load_boston()
x = data.data
y= data.target
y = pd.DataFrame(y)

Then I implemented the feature transformation library on the dataset.

import autofeat as af
clf = af.AutoFeatRegressor()
df = clf.fit_transform(x,y)
df = pd.DataFrame(df)

After this, I implemented another function to find the score of each feature in relation to the label.

from sklearn.feature_selection import SelectKBest, chi2
X_new = SelectKBest(chi2, k=20)
X_new_done = X_new.fit_transform(df,y)
dfscores = pd.DataFrame(X_new.scores_)
dfcolumns = pd.DataFrame(X_new_done.columns)
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Specs','Score']
print(featureScores.nlargest(10,'Score'))

This gave error as following.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-16-b0fa1556bdef> in <module>()
      1 from sklearn.feature_selection import SelectKBest, chi2
      2 X_new = SelectKBest(chi2, k=20)
----> 3 X_new_done = X_new.fit_transform(df,y)
      4 dfscores = pd.DataFrame(X_new.scores_)
      5 dfcolumns = pd.DataFrame(X_new_done.columns)

ValueError: Input X must be non-negative.

I had a few negative numbers in my dataset. So how can I overcome this problem?

Note:- df has now transformations of y, its only having transformations of x.

Samar Pratap Singh
  • 471
  • 1
  • 10
  • 29

1 Answers1

4

You have a feature with all negative values:

df['exp(x005)*log(x000)']

returns

0     -3630.638503
1     -2212.931477
2     -4751.790753
3     -3754.508972
4     -3395.387438
          ...
501   -2022.382877
502   -1407.856591
503   -2998.638158
504   -1973.273347
505   -1267.482741
Name: exp(x005)*log(x000), Length: 506, dtype: float64

Quoting another answer (https://stackoverflow.com/a/46608239/5025009):

The error message Input X must be non-negative says it all: Pearson's chi square test (goodness of fit) does not apply to negative values. It's logical because the chi square test assumes frequencies distribution and a frequency can't be a negative number. Consequently, sklearn.feature_selection.chi2 asserts the input is non-negative.

In many cases, it may be quite safe to simply shift each feature to make it all positive, or even normalize to [0, 1] interval as suggested by EdChum.

If data transformation is for some reason not possible (e.g. a negative value is an important factor), you should pick another statistic to score your features:

Since the whole point of this procedure is to prepare the features for another method, it's not a big deal to pick anyone, the end result usually the same or very close.

seralouk
  • 30,938
  • 9
  • 118
  • 133