Sorting TfidfVectorizer output by tf-idf (lowest to highest and vice versa)

Question

I'm using TfidfVectorizer() from sklearn on part of my text data to get a sense of term-frequency for each feature (word). My current code is the following

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(analyzer='word', stop_words = 'english')

# fit_transform on training data
X_traintfidf = tfidf.fit_transform(X_train)

If I want to sort the tf-idf values of each term in 'X_traintfidf' from the lowest to highest (and vice versa), say, top10, and make these sorted tf-idf value rankings into two Series objects, how should I proceed from the last line of my code?

Thank you.

I was reading a similar thread but couldn't figure out how to do it. Maybe someone will be able to connect the tips shown in that thread to my question here.

score 11 · Accepted Answer · edited Dec 06 '18 at 21:43

11

After the fit_transform(), you'll have access to the existing vocabulary through get_feature_names() method. You can do this:

terms = tfidf.get_feature_names()

# sum tfidf frequency of each term through documents
sums = X_traintfidf.sum(axis=0)

# connecting term to its sums frequency
data = []
for col, term in enumerate(terms):
    data.append( (term, sums[0,col] ))

ranking = pd.DataFrame(data, columns=['term','rank'])
print(ranking.sort_values('rank', ascending=False))

edited Dec 06 '18 at 21:43

Midimistro

315
2
12

answered Sep 30 '17 at 17:53

Adelson Araújo

332
1
5
17

Since tf is term frequency, i.e. term count/ total number of terms, I believe the correct thing to do to calculate 'sums' is to find a dot product of X_traintfidf with the vector of the lengths of the document collection in X_train – David Makovoz Nov 05 '18 at 17:29

Sorting TfidfVectorizer output by tf-idf (lowest to highest and vice versa)

1 Answers1