0

I am trying to implement tf-idf in python using sklearn.

Here's what I got so far:

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = ["This is very strange",
          "This is very nice"]
vectorizer = TfidfVectorizer(min_df=1)
X = vectorizer.fit_transform(corpus)
idf = vectorizer.idf_
dic = dict(zip(vectorizer.get_feature_names(), idf))
print dic

Now, when I change my corpus to my original dataset, which is like this:

corpus = [["This is very strange"],
          ["This is very nice"]]

and code to this:

vectorizer = TfidfVectorizer(min_df=1)
f = list()
for doc in corpus:
    X = vectorizer.fit_transform(doc)
    idf = vectorizer.idf_
    dic = dict(zip(vectorizer.get_feature_names(), idf))
    f.append(dic)
print f

It won't work.

So basically, I have multiple documents in 2D List. And originally, I had a 1D list with documents.

Further after calculating tf-idf, I will apply classification on it.

How should I get my tf-idf working?

nirvair
  • 4,001
  • 10
  • 51
  • 85
  • Can you flatten the corpus that you have to be just a list? – titipata Jun 12 '17 at 22:29
  • I thought of that. But I have to apply classifier on these documents, and I want to maintain list of documents so I could get the labels for those documents. – nirvair Jun 12 '17 at 22:38
  • 2
    I will probably expand the list of list first into `docs = [[doc1, label1], ... [docN, labelN]]` then select only `docs_ = [d[0] for d in docs]` then use `TfidfVectorizer`. There are multiple ways to explode list of list e.g. in pandas: [1](https://stackoverflow.com/questions/38231591/splitting-dictionary-list-inside-a-pandas-column-into-separate-columns), [2](https://stackoverflow.com/questions/14745022/pandas-dataframe-how-do-i-split-a-column-into-two) – titipata Jun 12 '17 at 22:52
  • But, still I won't get - [['d1_idf1', 'd1_idf2', ... 'd1_idfn'], ['d2_idf1', 'd2_idf2', ... 'd2_idfn']]. Here I know the idf of each term in every document, which would help me further in classification. – nirvair Jun 13 '17 at 10:27
  • Corpus argument must be 1D list, an iterable which yields either str, unicode or file objects. what is your classification task after this step? maybe it can be work with this data structure. – Mahmood Kohansal Jun 14 '17 at 08:18
  • 1
    Yeah, I got confused. It's a 1D list. And it works. Thanks @titipata – nirvair Jun 14 '17 at 14:51

0 Answers0