I have implemented TF-IDF into a simple program but want to calculate the TF-IDF per line rather than the whole file.
I have used from sklearn.feature_extraction.text import TfidfVectorizer
and looked at the following link as an example tf-idf feature weights using sklearn.feature_extraction.text.TfidfVectorizer
This is my code:
from sklearn.feature_extraction.text import TfidfVectorizer
f1 = open('testDB.txt','r')
a = []
for line in f1:
a.append(line.strip())
f1.close()
f2 = open('testDB1.txt','r')
b = []
for line in f2:
b.append(line.strip())
f2.close()
for i in range(min(len(a), len(b))):
vectorizer = TfidfVectorizer(min_df=1)
X = vectorizer.fit_transform(a, b)
idf = vectorizer.idf_
print dict(zip(vectorizer.get_feature_names(), idf))
The text files include:
testDB.txt =
hello my name is tom
epping is based just outside of london football
epping football club is really bad
testDB1.txt =
hello my name is tom
i live in chelmsford and i play football
chelmsford is a lovely city
The output:
{u'based': 1.6931471805599454, u'name': 1.6931471805599454, u'just': 1.6931471805599454, u'outside': 1.6931471805599454, u'club': 1.6931471805599454, u'of': 1.6931471805599454, u'is': 1.0, u'football': 1.2876820724517808, u'epping': 1.2876820724517808, u'bad': 1.6931471805599454, u'london': 1.6931471805599454, u'tom': 1.6931471805599454, u'my': 1.6931471805599454, u'hello': 1.6931471805599454, u'really': 1.6931471805599454}
{u'based': 1.6931471805599454, u'name': 1.6931471805599454, u'just': 1.6931471805599454, u'outside': 1.6931471805599454, u'club': 1.6931471805599454, u'of': 1.6931471805599454, u'is': 1.0, u'football': 1.2876820724517808, u'epping': 1.2876820724517808, u'bad': 1.6931471805599454, u'london': 1.6931471805599454, u'zain': 1.6931471805599454, u'my': 1.6931471805599454, u'hello': 1.6931471805599454, u'really': 1.6931471805599454}
{u'based': 1.6931471805599454, u'name': 1.6931471805599454, u'just': 1.6931471805599454, u'outside': 1.6931471805599454, u'club': 1.6931471805599454, u'of': 1.6931471805599454, u'is': 1.0, u'football': 1.2876820724517808, u'epping': 1.2876820724517808, u'bad': 1.6931471805599454, u'london': 1.6931471805599454, u'tom': 1.6931471805599454, u'my': 1.6931471805599454, u'hello': 1.6931471805599454, u'really': 1.6931471805599454}
As you can see it does the TF-IDF for the whole documents for both text files rather than per line. I have tried to implement per line using the for loop but i cannot figure out the problem.
Ideally the output would print the TF-IDF per line. Eg
u'hello': 0.23123, u'my': 0.3123123, u'name': '0.2313213, u'is': 0.3213132, u'tom': 0.3214344
etc.
If anyone can help me or give any advice that would be great.