Creating a TF IDF matrix in Python

Asked Dec 11 '21 at 19:22

Active Dec 11 '21 at 19:22

Viewed 105 times

I have a list of lists in the form:

[['alice', 'in', 'wonderland',....], ['the', 'final', 'showdown',....],.............]

Where each element corresponds to the word tokens of a specific document (that I have processed)

I want to create a term frequency and inverse document frequency matrix, but I'm not sure how to go about doing this.

I'm thinking of using a pandas dataframe to store the data for this but not really sure how to iterate over it to get the TF and IDF (I know nltk might have some tools)

Any help would be appreciated!

asked Dec 11 '21 at 19:22

IVB_CODING

educate me on "frequency and inverse document frequency matrix" – monucool Dec 11 '21 at 19:25
@monucool term frequency checks how many times each word appears in that specific document and inverse document frequency is a measure of how 'rare'/ relevant a specific word is inn a document – IVB_CODING Dec 11 '21 at 19:27
is this an option https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html ? – Ezer K Dec 11 '21 at 20:23
@EzerK , I would prefer to use something from the nltk package instead so do you know of any alternative methods? I'll try this out though and see how it goes – IVB_CODING Dec 11 '21 at 20:37
Does this help you https://stackoverflow.com/questions/29570207/does-nltk-have-tf-idf-implemented? – kkgarg Dec 11 '21 at 22:50

Creating a TF IDF matrix in Python

0 Answers0