25

I've built a pipeline in Scikit-Learn with two steps: one to construct features, and the second is a RandomForestClassifier.

While I can save that pipeline, look at various steps and the various parameters set in the steps, I'd like to be able to examine the feature importances from the resulting model.

Is that possible?

Pavel Fedotov
  • 748
  • 1
  • 7
  • 29
elksie5000
  • 7,084
  • 12
  • 57
  • 87

2 Answers2

40

Ah, yes it is.

You list identify the step where you want to check the estimator:

For instance:

pipeline.steps[1]

Which returns:

('predictor',
 RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
             max_depth=None, max_features='auto', max_leaf_nodes=None,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=2,
             oob_score=False, random_state=None, verbose=0,
             warm_start=False))

You can then access the model step directly:

pipeline.steps[1][1].feature_importances_
petezurich
  • 9,280
  • 9
  • 43
  • 57
elksie5000
  • 7,084
  • 12
  • 57
  • 87
  • 5
    And to get the name of the features, you'd look at pipe.steps[0][1].get_feature_names() – Devon Jul 13 '17 at 23:27
  • 2
    This is an incomplete answer. Preprocessing and feature engineering are usually part of a pipeline. Therefore, you'd need to take this into account. – ben26941 Mar 21 '18 at 14:05
  • 9
    If there is more than 1 step, then one approach is to [use the name of the step to retrieve the estimator](https://stackoverflow.com/a/28837740). For the OP's case, this could be `pipeline.named_steps['predictor'].feature_importances_`. – edesz Sep 14 '18 at 01:11
  • how can you change the feature importance type? – Maths12 Nov 10 '20 at 16:29
7

I wrote an article on doing this in general you can find here.

In general for a pipeline you can access the named_steps parameter. This will give you each transformer in a pipeline. So for example for this pipeline:

model = Pipeline(
[
    ("vectorizer", CountVectorizer()),
    ("transformer", TfidfTransformer()),
    ("classifier", classifier),
])

we could access the individual feature steps by doing model.named_steps["transformer"].get_feature_names() This will return the list of feature names from the TfidfTransformer. This is all fine and good but doesn't really cover many use cases since we normally want to combine a few features. Take this model for example:

model = Pipeline([
("union", FeatureUnion(transformer_list=[
    ("h1", TfidfVectorizer(vocabulary={"worst": 0})),
    ("h2", TfidfVectorizer(vocabulary={"best": 0})),
    ("h3", TfidfVectorizer(vocabulary={"awful": 0})),
    ("tfidf_cls", Pipeline([
        ("vectorizer", CountVectorizer()),
        ("transformer", TfidfTransformer())
    ]
    ))
])
 ),
("classifier", classifier)])

Here we combine a few features using a feature union and a subpipeline. To access these features we'd need to explicitly call each named step in order. For example getting the TF-IDF features from the internal pipeline we'd have to do:

model.named_steps["union"].tranformer_list[3][1].named_steps["transformer"].get_feature_names()

That's kind of a headache but it is doable. Usually what I do is use a variation of the following snippet to get it. The below code just treats sets of pipelines/feature unions as a tree and performs DFS combining the feature_names as it goes.

from sklearn.pipeline import FeatureUnion, Pipeline

def get_feature_names(model, names: List[str], name: str) -> List[str]:
    """Thie method extracts the feature names in order from a Sklearn Pipeline
    
    This method only works with composed Pipelines and FeatureUnions.  It will
    pull out all names using DFS from a model.

    Args:
        model: The model we are interested in
        names: The list of names of final featurizaiton steps
        name: The current name of the step we want to evaluate.

    Returns:
        feature_names: The list of feature names extracted from the pipeline.
    """
    
    # Check if the name is one of our feature steps.  This is the base case.
    if name in names:
        # If it has the named_steps atribute it's a pipeline and we need to access the features
        if hasattr(model, "named_steps"):
            return extract_feature_names(model.named_steps[name], name)
        # Otherwise get the feature directly
        else:
            return extract_feature_names(model, name)
    elif type(model) is Pipeline:
        feature_names = []
        for name in model.named_steps.keys():
            feature_names += get_feature_names(model.named_steps[name], names, name)
        return feature_names
    elif type(model) is FeatureUnion:
        feature_names= []
        for name, new_model in model.transformer_list:
            feature_names += get_feature_names(new_model, names, name)
        return feature_names
    # If it is none of the above do not add it.
    else:
        return []

You'll also need this method. Which operates on individual transformations, things like the TfidfVectorizer, to get the names. In SciKit-Learn there isn't a universal get_feature_names so you have to kind of fudge it for each different case. This is my attempt at doing something reasonable for most use cases.

def extract_feature_names(model, name) -> List[str]:
  """Extracts the feature names from arbitrary sklearn models
  
  Args:
    model: The Sklearn model, transformer, clustering algorithm, etc. which we want to get named features for.
    name: The name of the current step in the pipeline we are at.

  Returns:
    The list of feature names.  If the model does not have named features it constructs feature names
by appending an index to the provided name.
  """
    if hasattr(model, "get_feature_names"):
        return model.get_feature_names()
    elif hasattr(model, "n_clusters"):
        return [f"{name}_{x}" for x in range(model.n_clusters)]
    elif hasattr(model, "n_components"):
        return [f"{name}_{x}" for x in range(model.n_components)]
    elif hasattr(model, "components_"):
        n_components = model.components_.shape[0]
        return [f"{name}_{x}" for x in range(n_components)]
    elif hasattr(model, "classes_"):
        return classes_
    else:
        return [name]
nbertagnolli
  • 418
  • 5
  • 10