'Pipeline' object has no attribute 'get_feature_names' in scikit-learn

Question

I am basically clustering some of my documents using mini_batch_kmeans and kmeans algorithm. I simply followed the tutorial is the scikit-learn website the link for that is given below: http://scikit-learn.org/stable/auto_examples/text/document_clustering.html

They are using some of the method for the vectorizing one of which is HashingVectorizer. In the hashingVectorizer they are making a pipeline with TfidfTransformer() method.

# Perform an IDF normalization on the output of HashingVectorizer
hasher = HashingVectorizer(n_features=opts.n_features,
                               stop_words='english', non_negative=True,
                               norm=None, binary=False)
vectorizer = make_pipeline(hasher, TfidfTransformer())

Once doing so, the vectorizer what I get from that does not have the method get_feature_names(). But since I am using it for clustering, I need to get the "terms" using this "get_feature_names()"

terms = vectorizer.get_feature_names()
for i in range(true_k):
    print("Cluster %d:" % i, end='')
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind], end='')
    print()

How do I solve this error?

My whole code is show below:

X_train_vecs, vectorizer = vector_bow.count_tfidf_vectorizer(_contents)
mini_kmeans_batch = MiniBatchKmeansTechnique()
# MiniBatchKmeans without the LSA dimensionality reduction
mini_kmeans_batch.mini_kmeans_technique(number_cluster=8, X_train_vecs=X_train_vecs,
                                                vectorizer=vectorizer, filenames=_filenames, contents=_contents, is_dimension_reduced=False)

The count vectorizor piped with tfidf.

def count_tfidf_vectorizer(self,contents):
    count_vect = CountVectorizer()
    vectorizer = make_pipeline(count_vect,TfidfTransformer())
    X_train_vecs = vectorizer.fit_transform(contents)
    print("The count of bow : ", X_train_vecs.shape)
    return X_train_vecs, vectorizer

and the mini_batch_kmeans class is as below:

class MiniBatchKmeansTechnique():
    def mini_kmeans_technique(self, number_cluster, X_train_vecs, vectorizer,
                              filenames, contents, svd=None, is_dimension_reduced=True):
        km = MiniBatchKMeans(n_clusters=number_cluster, init='k-means++', max_iter=100, n_init=10,
                         init_size=1000, batch_size=1000, verbose=True, random_state=42)
        print("Clustering sparse data with %s" % km)
        t0 = time()
        km.fit(X_train_vecs)
        print("done in %0.3fs" % (time() - t0))
        print()
        cluster_labels = km.labels_.tolist()
        print("List of the cluster names is : ",cluster_labels)
        data = {'filename':filenames, 'contents':contents, 'cluster_label':cluster_labels}
        frame = pd.DataFrame(data=data, index=[cluster_labels], columns=['filename', 'contents', 'cluster_label'])
        print(frame['cluster_label'].value_counts(sort=True,ascending=False))
        print()
        grouped = frame['cluster_label'].groupby(frame['cluster_label'])
        print(grouped.mean())
        print()
        print("Top Terms Per Cluster :")

        if is_dimension_reduced:
            if svd != None:
                original_space_centroids = svd.inverse_transform(km.cluster_centers_)
                order_centroids = original_space_centroids.argsort()[:, ::-1]
        else:
            order_centroids = km.cluster_centers_.argsort()[:, ::-1]

        terms = vectorizer.get_feature_names()
        for i in range(number_cluster):
            print("Cluster %d:" % i, end=' ')
            for ind in order_centroids[i, :10]:
                print(' %s' % terms[ind], end=',')
            print()
            print("Cluster %d filenames:" % i, end='')
            for file in frame.ix[i]['filename'].values.tolist():
                print(' %s,' % file, end='')
            print()

Have you fitted the pipeline? Please post the complete code. — Vivek Kumar, Jun 21 '17 at 12:07
First thing is, what are you using? HashingVectorizer or CountVectorizer? — Vivek Kumar, Jun 21 '17 at 12:35
Second, there is no need to make pipeline for CountVectorizer and TfidfTransformer. Use TfidfVectorizer instead. — Vivek Kumar, Jun 21 '17 at 12:36
Third, in the tutorial, they are calling `get_feature_names()` only on non hashing pipeline. See the `if block` just above where they use `get_feature_names()`. — Vivek Kumar, Jun 21 '17 at 12:37

Mikhail Korobov · Accepted Answer · 2017-06-21T15:25:57.867

Pipeline doesn't have get_feature_names() method, as it is not straightforward to implement this method for Pipeline - one needs to consider all pipeline steps to get feature names. See https://github.com/scikit-learn/scikit-learn/issues/6424, https://github.com/scikit-learn/scikit-learn/issues/6425, etc. - there is a lot of related tickets and several attempts to fix it.

If your pipeline is simple (TfidfVectorizer followed by MiniBatchKMeans) then you can get feature names from TfidfVectorizer.

If you want to use HashingVectorizer, it is more complicated, as HashingVectorizer doesn't provide feature names by design. HashingVectorizer doesn't store vocabulary, and uses hashes instead - it means it can be applied in online setting, and that it dosn't require any RAM - but the tradeoff is exactly that you don't get feature names.

It is still possible to get feature names from HashingVectorizer though; to do this you need to apply it for a sample of documents, store which hashes correspond to which words, and this way learn what these hashes mean, i.e. what are the feature names. There may be collisions, so it is not possible to be 100% sure the feature name is correct, but usually this approach works ok. This approach is implemented in eli5 library; see http://eli5.readthedocs.io/en/latest/tutorials/sklearn-text.html#debugging-hashingvectorizer for an example. You will have to do something like this, using InvertableHashingVectorizer:

from eli5.sklearn import InvertableHashingVectorizer
ivec = InvertableHashingVectorizer(vec)  # vec is a HashingVectorizer instance
# X_sample is a sample from contents; you can use the 
# whole contents array, or just e.g. every 10th element
ivec.fit(content_sample)  
hashing_feat_names = ivec.get_feature_names()

Then you can use hashing_feat_names as your feature names, as TfidfTransformer doesn't change input vector size and just scales the same features.

mkaran · Answer 2 · 2017-06-21T13:49:29.297

From the make_pipeline documentation:

This is a shorthand for the Pipeline constructor; it does not require, and
    does not permit, naming the estimators. Instead, their names will be set
    to the lowercase of their types automatically.

so, in order to access the feature names, after you have fitted to data, you can:

# Perform an IDF normalization on the output of HashingVectorizer
from sklearn.feature_extraction.text import HashingVectorizer, TfidfVectorizer
from sklearn.pipeline import make_pipeline

hasher = HashingVectorizer(n_features=10,
                           stop_words='english', non_negative=True,
                            norm=None, binary=False)

tfidf = TfidfVectorizer()
vectorizer = make_pipeline(hasher, tfidf)
# ...    
# fit to the data
# ... 

# use the instance's class name to lower 
terms = vectorizer.named_steps[tfidf.__class__.__name__.lower()].get_feature_names()

# or to be more precise, as used in `_name_estimators`:
# terms = vectorizer.named_steps[type(tfidf).__name__.lower()].get_feature_names()
# btw TfidfTransformer and HashingVectorizer do not have get_feature_names afaik

Hope this helps, good luck!

Edit: After seeing your updated question with the example you follow, @Vivek Kumar is correct, this code terms = vectorizer.get_feature_names() will not run for the pipeline but only when:

vectorizer = TfidfVectorizer(max_df=0.5, max_features=opts.n_features,
                                 min_df=2, stop_words='english',
                                 use_idf=opts.use_idf)

Thank you this has helped me a lot. Kindly do note that not all transformers have the get_feature_names function. For my case, the PolynomialFeatures() has the get_feature_names function but the StandardScaler() does not and throws an error. — Jane Kathambi, Jun 28 '21 at 17:18
@JaneKathambi Glad it helped :) Thanks for your note, good to know `StandardScaler` does not have this function too. I do have a comment about the missing get_feature_names function of `TfidfTransformer` and `TfidfTransformer` (`# btw TfidfTransformer and HashingVectorizer do not have get_feature_names afaik`) — mkaran, Jun 30 '21 at 16:43

'Pipeline' object has no attribute 'get_feature_names' in scikit-learn

2 Answers2

Linked