8

I'm using TfidfVectorizer() from sklearn on part of my text data to get a sense of term-frequency for each feature (word). My current code is the following

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(analyzer='word', stop_words = 'english')

# fit_transform on training data
X_traintfidf = tfidf.fit_transform(X_train)

If I want to sort the tf-idf values of each term in 'X_traintfidf' from the lowest to highest (and vice versa), say, top10, and make these sorted tf-idf value rankings into two Series objects, how should I proceed from the last line of my code?

Thank you.

I was reading a similar thread but couldn't figure out how to do it. Maybe someone will be able to connect the tips shown in that thread to my question here.

Chris T.
  • 1,699
  • 7
  • 23
  • 45

1 Answers1

11

After the fit_transform(), you'll have access to the existing vocabulary through get_feature_names() method. You can do this:

terms = tfidf.get_feature_names()

# sum tfidf frequency of each term through documents
sums = X_traintfidf.sum(axis=0)

# connecting term to its sums frequency
data = []
for col, term in enumerate(terms):
    data.append( (term, sums[0,col] ))

ranking = pd.DataFrame(data, columns=['term','rank'])
print(ranking.sort_values('rank', ascending=False))
Midimistro
  • 315
  • 2
  • 12
Adelson Araújo
  • 332
  • 1
  • 5
  • 17
  • Since tf is term frequency, i.e. term count/ total number of terms, I believe the correct thing to do to calculate 'sums' is to find a dot product of X_traintfidf with the vector of the lengths of the document collection in X_train – David Makovoz Nov 05 '18 at 17:29