tf-idf - accessing a large sparse scipy matrix & getting the highest values

Question

For the tfidf result matrix, I wanted to get the top tfidf values. I saw how one could set max features amount for the tfidf vectorizer, but that is for the words with the top tf count. I want to still get the high values for the tfidf, which could include words with low tf. One idea I looked up is doing something like tf_idf_matrix.sum(axis=0), which would sum up the columns. This works in my code, but because of 113k columns, print wont show them all. If I could use something like argsort() to access the top K column sum values, that would be helpful.

This question stems off my original question which is here.

The reason is that I want to know which words are the ones I should look at closer, and not necessarily the ones that have the highest frequency. I would also like to know about the "anomalies" that is, words that might not appear in all or many documents/posts but could have a high tfidf in a one or fewer documents. In case there are other approaches I should consider, I wanted to explain this.

Thanks

To get the `k` highest column sums: `col_sum = tf_idf_matrix.sum(axis=0).A.squeeze(); idx = np.argsort(col_sum)[-k:][::-1]` and now `idx` holds the column numbers of the top 5 column sums, and you can get the values from `col_sum[idx]`. — Jaime, Nov 14 '13 at 01:06

tf-idf - accessing a large sparse scipy matrix & getting the highest values

0 Answers0