For the tfidf result matrix, I wanted to get the top tfidf values. I saw how one could set max features amount for the tfidf vectorizer, but that is for the words with the top tf count. I want to still get the high values for the tfidf, which could include words with low tf. One idea I looked up is doing something like tf_idf_matrix.sum(axis=0)
, which would sum up the columns. This works in my code, but because of 113k columns, print wont show them all. If I could use something like argsort()
to access the top K column sum values, that would be helpful.
This question stems off my original question which is here.
The reason is that I want to know which words are the ones I should look at closer, and not necessarily the ones that have the highest frequency. I would also like to know about the "anomalies" that is, words that might not appear in all or many documents/posts but could have a high tfidf in a one or fewer documents. In case there are other approaches I should consider, I wanted to explain this.
Thanks