1

I have a feature union which uses some custom transformers to select text and parts of a dataframe. I would like to understand which features it's using.

The pipeline selects and transforms columns and then selects k best. I'm able to pull out the features from k best using the following code:

mask = union.named_steps['select_features'].get_support()

However I am unable to apply this mask to the feature union output as I'm struggling to return the final transformation. I think I need to define a 'get_feature_names' function within the custom transformer - see related post.

The pipeline is as follows:

union = Pipeline([
('feature_union', FeatureUnion([

    ('pipeline_1', Pipeline([
        ('selector', TextSelector(key='notes_1')),
        ('vectorise', CountVectorizer())
    ])),

    ('pipeline_2', Pipeline([
        ('selector', TextSelector(key='notes_2')),
        ('vectorise', CountVectorizer())
    ])),

    ('pipeline_3', Pipeline([
        ('selector', TextSelector(key='notes_3')),
        ('vectorise', CountVectorizer())
    ])),

    ('pipeline_4', Pipeline([
        ('selector', TextSelector(key='notes_4')),
        ('vectorise', CountVectorizer())
    ])),

    ('tf-idf_pipeline', Pipeline([
        ('selector', TextSelector(key='notes_5')),
        ('Tf-idf', TfidfVectorizer())
    ])),

    ('categorical_pipeline', Pipeline([
        ('selector', DataFrameSelector(['area', 'type', 'age'], True)),
        ('one_hot_encoding', OneHotEncoder(handle_unknown='ignore'))
    ]))
], n_jobs=-1)),
('select_features', SelectKBest(k='all')),
('classifier', MLPClassifier())
])

Custom transformers as follows NB i've tried including a 'get_feature_names' function within each transformer which isn't working correctly:

class TextSelector(BaseEstimator, TransformerMixin):
   def __init__(self, key):
       self.key = key

   def fit(self, X, y=None):
       return self

   def transform(self, X):
       return X[self.key]

   def get_feature_names(self):
       return X[self.key].columns.tolist()


class DataFrameSelector(BaseEstimator, TransformerMixin):
   def __init__(self, attribute_names, factorize=False):
    self.attribute_names = attribute_names
    self.factorize = factorize

   def transform(self, X):
    selection = X[self.attribute_names]
    if self.factorize:
        selection = selection.apply(lambda p: pd.factorize(p)[0] + 1)
       return selection.values

   def fit(self, X, y=None):
       return self

   def get_feature_names(self):
       return X.columns.tolist()

Thanks for help.

shbfy
  • 2,075
  • 3
  • 16
  • 37
  • From the post that you link, it is also mentioned to subclass Pipeline and add the get_feature_names(). Did you try this as well? – Zouzias Feb 01 '18 at 08:29

3 Answers3

2

This one worked for me. Simply as was advised

union = Pipeline([
('feature_union', FeatureUnion([

('pipeline_1', MyPipeline([
    ('selector', TextSelector(key='notes_1')),
    ('vectorise', CountVectorizer())
])),
])

class myPipeline(Pipeline):
    def get_feature_names(self):
        for name, step in self.steps:
            if isinstance(step,TfidfVectorizer):
                return step.get_feature_names()
nlyf
  • 36
  • 5
2

Till now the best way to get a nested feature (thanks edesz):

pipeline = Pipeline(steps=[
     ("union", FeatureUnion(
      transformer_list=[
        ("descriptor", Pipeline(steps=[
            ("selector", ItemSelector(column="Description")),
            ("tfidf", TfidfVectorizer(min_df=5, analyzer=u'word'))
        ]))
    ],...

pvect= dict(pipeline.named_steps['union'].transformer_list).get('descriptor').named_steps['tfidf']

And then you got the TfidfVectorizer() instance to pass in another function:

Show_most_informative_features(pvect,
           pipeline.named_steps['classifier'], n=MostIF)
Max Kleiner
  • 1,442
  • 1
  • 13
  • 14
1

If you know the name of the step (ex. pipeline_1) and the name of the substep where the custom transformer is called (ex. vectorise), then you can refer directly to the steps and substeps by their names

fnames = dict(union.named_steps['feature_union']
            .transformer_list)
            .get('pipeline_1')
            .named_steps['vectorise']
            .get_feature_names()

Source used

edesz
  • 11,756
  • 22
  • 75
  • 123