3

There is this code:

feature_array = np.array(tfidf.get_feature_names())
tfidf_sorting = np.argsort(response.toarray()).flatten()[::-1]

n = 3
top_n = feature_array[tfidf_sorting][:n]

coming from this answer.

My question is how can I efficiently do this in the case where my sparse matrix is too big to convert at once to a dense matrix (with response.toarray())?

Apparently, the general answer is by splitting the sparse matrix in chunks, doing the conversion of each chunk in a for loop and then combining the results across all chunks.

But I would like to see specifically the code which does this in total.

Venkatachalam
  • 16,288
  • 9
  • 49
  • 77
Outcast
  • 4,967
  • 5
  • 44
  • 99

1 Answers1

8

If you have a deep look at that question, they are interested at knowing top tf_idf scores for a single document.

when you want to do the same thing for a large corpus, you need to sum the scores of each feature across all documents (still its not meaningfull because the scores are l2 normalized in TfidfVectorizer(), read here). I would recommend using .idf_ scores to know the features with high inverse document frequency score.

In case, you want to know the top features based on number of occurrences, use CountVectorizer()

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
corpus = [
    'I would like to check this document',
    'How about one more document',
    'Aim is to capture the key words from the corpus'
]
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(corpus)
feature_array = vectorizer.get_feature_names()

top_n = 3

print('tf_idf scores: \n', sorted(list(zip(vectorizer.get_feature_names(), 
                                             X.sum(0).getA1())), 
                                 key=lambda x: x[1], reverse=True)[:top_n])
# tf_idf scores : 
# [('document', 1.4736296010332683), ('check', 0.6227660078332259), ('like', 0.6227660078332259)]

print('idf values: \n', sorted(list(zip(feature_array,vectorizer.idf_,)),
       key = lambda x: x[1], reverse=True)[:top_n])

# idf values: 
#  [('aim', 1.6931471805599454), ('capture', 1.6931471805599454), ('check', 1.6931471805599454)]

vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(corpus)
feature_array = vectorizer.get_feature_names()
print('Frequency: \n', sorted(list(zip(vectorizer.get_feature_names(), 
                                         X.sum(0).getA1())),
                            key=lambda x: x[1], reverse=True)[:top_n])

# Frequency: 
#  [('document', 2), ('aim', 1), ('capture', 1)]
Venkatachalam
  • 16,288
  • 9
  • 49
  • 77
  • Hey, thank you for your answer (upvote). Yes, I want this across all documents and I see what you mean about the L2. In this sense, perhaps it is better to go for a simple count (CountVectorizer). By the way, my question was more on how to do this on a big TF-IDF sparse matrix - does your code work too in this case or I will get a memory error? I think it actually does since you directly `.sum()`. Moreover, I think that you can also answer this question of mine: https://stackoverflow.com/questions/56703244/find-top-n-terms-with-highest-tf-idf-score-per-class - please do if you can :) . – Outcast Jun 24 '19 at 10:46
  • I think, it can work for big sparse matrix since I am not using `.toarray()`. – Venkatachalam Jun 24 '19 at 11:27
  • Yes, this is what I think too - I have not tested yet. – Outcast Jun 24 '19 at 11:33
  • By the way, I do not know if you have this in mind but I think that your code above and specifically how you use `sorted` does not sort words by the (tf-idf or idf or etc) value but by the name of the word. – Outcast Jun 25 '19 at 15:24
  • You have to use this `.sort(key=lambda x: x[1], reverse=True)` or sth similar. – Outcast Jun 25 '19 at 16:11