tf-idf in python TfidfVectorizer

Question

I am trying to implement tf-idf in python using sklearn.

Here's what I got so far:

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = ["This is very strange",
          "This is very nice"]
vectorizer = TfidfVectorizer(min_df=1)
X = vectorizer.fit_transform(corpus)
idf = vectorizer.idf_
dic = dict(zip(vectorizer.get_feature_names(), idf))
print dic

Now, when I change my corpus to my original dataset, which is like this:

corpus = [["This is very strange"],
          ["This is very nice"]]

and code to this:

vectorizer = TfidfVectorizer(min_df=1)
f = list()
for doc in corpus:
    X = vectorizer.fit_transform(doc)
    idf = vectorizer.idf_
    dic = dict(zip(vectorizer.get_feature_names(), idf))
    f.append(dic)
print f

It won't work.

So basically, I have multiple documents in 2D List. And originally, I had a 1D list with documents.

Further after calculating tf-idf, I will apply classification on it.

How should I get my tf-idf working?

I thought of that. But I have to apply classifier on these documents, and I want to maintain list of documents so I could get the labels for those documents. — nirvair, Jun 12 '17 at 22:38
I will probably expand the list of list first into `docs = [[doc1, label1], ... [docN, labelN]]` then select only `docs_ = [d[0] for d in docs]` then use `TfidfVectorizer`. There are multiple ways to explode list of list e.g. in pandas: [1](https://stackoverflow.com/questions/38231591/splitting-dictionary-list-inside-a-pandas-column-into-separate-columns), [2](https://stackoverflow.com/questions/14745022/pandas-dataframe-how-do-i-split-a-column-into-two) — titipata, Jun 12 '17 at 22:52
But, still I won't get - [['d1_idf1', 'd1_idf2', ... 'd1_idfn'], ['d2_idf1', 'd2_idf2', ... 'd2_idfn']]. Here I know the idf of each term in every document, which would help me further in classification. — nirvair, Jun 13 '17 at 10:27
Corpus argument must be 1D list, an iterable which yields either str, unicode or file objects. what is your classification task after this step? maybe it can be work with this data structure. — Mahmood Kohansal, Jun 14 '17 at 08:18
Yeah, I got confused. It's a 1D list. And it works. Thanks @titipata — nirvair, Jun 14 '17 at 14:51

tf-idf in python TfidfVectorizer

0 Answers0