I have a feature union which uses some custom transformers to select text and parts of a dataframe. I would like to understand which features it's using.
The pipeline selects and transforms columns and then selects k best. I'm able to pull out the features from k best using the following code:
mask = union.named_steps['select_features'].get_support()
However I am unable to apply this mask to the feature union output as I'm struggling to return the final transformation. I think I need to define a 'get_feature_names' function within the custom transformer - see related post.
The pipeline is as follows:
union = Pipeline([
('feature_union', FeatureUnion([
('pipeline_1', Pipeline([
('selector', TextSelector(key='notes_1')),
('vectorise', CountVectorizer())
])),
('pipeline_2', Pipeline([
('selector', TextSelector(key='notes_2')),
('vectorise', CountVectorizer())
])),
('pipeline_3', Pipeline([
('selector', TextSelector(key='notes_3')),
('vectorise', CountVectorizer())
])),
('pipeline_4', Pipeline([
('selector', TextSelector(key='notes_4')),
('vectorise', CountVectorizer())
])),
('tf-idf_pipeline', Pipeline([
('selector', TextSelector(key='notes_5')),
('Tf-idf', TfidfVectorizer())
])),
('categorical_pipeline', Pipeline([
('selector', DataFrameSelector(['area', 'type', 'age'], True)),
('one_hot_encoding', OneHotEncoder(handle_unknown='ignore'))
]))
], n_jobs=-1)),
('select_features', SelectKBest(k='all')),
('classifier', MLPClassifier())
])
Custom transformers as follows NB i've tried including a 'get_feature_names' function within each transformer which isn't working correctly:
class TextSelector(BaseEstimator, TransformerMixin):
def __init__(self, key):
self.key = key
def fit(self, X, y=None):
return self
def transform(self, X):
return X[self.key]
def get_feature_names(self):
return X[self.key].columns.tolist()
class DataFrameSelector(BaseEstimator, TransformerMixin):
def __init__(self, attribute_names, factorize=False):
self.attribute_names = attribute_names
self.factorize = factorize
def transform(self, X):
selection = X[self.attribute_names]
if self.factorize:
selection = selection.apply(lambda p: pd.factorize(p)[0] + 1)
return selection.values
def fit(self, X, y=None):
return self
def get_feature_names(self):
return X.columns.tolist()
Thanks for help.