3

Is there any way to remove a specific feature out of a scikit.learn dataset? For example, I know it is possible to remove features using sklearn.feature_selection, but these are all automated procedures removing features they decide are useless. Is there any way to implement a custom feature removal algorithm without going into the dirty insides of the data? For example, say I have a function that scores features, a toy example provided here:

def score(feature_index):
    return 0 if feature_index == 1 else 1

Now say I want to remove all those features in the iris dataset that score less than 0.5. I want to do something like this:

from sklearn import datasets
iris = datasets.load_iris()
#this is the function I want:
iris.filter_features(score, threshold=0.5)

after which I would like the iris dataset to have one less feature. Right now, I can do it like so:

from sklearn import datasets
iris = datasets.load_iris()
for feature_index in range(len(iris.feature_names)):
    if score(feature_index) < 0.5:
        iris.feature_names.pop(feature_index)
        iris.data = np.delete(iris.data, feature_index, 1)

but this looks... dirty.

5xum
  • 5,250
  • 8
  • 36
  • 56
  • 1
    See related: http://stackoverflow.com/questions/23405739/ignore-a-column-while-building-a-model-with-sklearn/23407329#23407329 this uses pandas to store the data but the principle is the same, you need to just define some list either as your feature selection or exclusion and then train again, nothing wrong with your current approach IMO. In pandas it would be easy to do the column selection/exclusion – EdChum Feb 03 '15 at 10:33
  • @EdChum I think there is nothing wrong with my approach, so long as it is done with care, but it can lead to trouble if one, for example, forgets to delete the appropriate feature name along with the column of the `data` array. – 5xum Feb 03 '15 at 12:28

2 Answers2

3

There is no such think as a scikit-learn dataset. scikit-learn uses common datastructures are just numpy arrays (or scipy sparse matrices):

>>> from sklearn.datasets import load_iris
>>> iris = load_iris
>>> type(iris.data)
<class 'numpy.ndarray'>

You can use regular numpy array indexing to generate a new version of data. For instance to drop the second feature with boolean mask:

>>> import numpy as np
>>> X = iris.data
>>> mask = np.array([True, False, True, True])
>>> X_masked = X[:, mask]

Note, the : sign in the first location means "all the rows".

To check, you can print the first 5 rows of each array:

>>> print(X[:5])
[[ 5.1  3.5  1.4  0.2]
 [ 4.9  3.   1.4  0.2]
 [ 4.7  3.2  1.3  0.2]
 [ 4.6  3.1  1.5  0.2]
 [ 5.   3.6  1.4  0.2]]
>>> print(X_masked[:5])
[[ 5.1  1.4  0.2]
 [ 4.9  1.4  0.2]
 [ 4.7  1.3  0.2]
 [ 4.6  1.5  0.2]
 [ 5.   1.4  0.2]]

You can also use integer based fancy indexing to get the same result:

>>> index = np.array([0, 2, 3])
>>> X_indexed = X[:, index]
>>> print(X_indexed[:5])
[[ 5.1  1.4  0.2]
 [ 4.9  1.4  0.2]
 [ 4.7  1.3  0.2]
 [ 4.6  1.5  0.2]
 [ 5.   1.4  0.2]]

To learn more about basic numpy operations, have a look at a tutorial such as:

http://scipy-lectures.github.io/

ogrisel
  • 39,309
  • 12
  • 116
  • 125
  • Thanks for the answer, I am quite familiar with both scipy and numpy, so finding the solution (using `numpy.delete`, which is equivalent to your solution) was not hard. I was just hoping to find a nice tool with which the (for me) problematic part (making sure that the feature name is deleted as well as the feature) would be automatized. – 5xum Feb 04 '15 at 20:28
  • You have to update your own copy of the `feature_names` array by your-self. To avoid introducing bugs, just don't modify the original arrays and instead work explicitly on copies. – ogrisel Feb 05 '15 at 10:52
3

While there is no built in class in sklearn to do this you can easily create one with the standard fit and transform methods:

from sklearn.base import TransformerMixin

class ManualFeatureSelector(TransformerMixin):
    """
    Transformer for manual selection of features using sklearn style 
    transform 
    method.  
    """

    def __init__(self, features):
        self.features = features
        pass

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X[:,self.features]

Generally it is better to do manual feature selection like this outside of the sklearn framework but I have come across situations where manual feature selection as part of a Pipeline is helpful.

For example if an object passes an array to both a classifier and some other object like a display function, you may want to pass only certain fields to the classifier. This is most easily done by changing the classifier to a Pipeline incorporating the above transform and the original classifier.

Hope that helps!