TF-IDF by string line rather than whole text document

Question

I have implemented TF-IDF into a simple program but want to calculate the TF-IDF per line rather than the whole file.

I have used from sklearn.feature_extraction.text import TfidfVectorizer and looked at the following link as an example tf-idf feature weights using sklearn.feature_extraction.text.TfidfVectorizer

This is my code:

from sklearn.feature_extraction.text import TfidfVectorizer

f1 = open('testDB.txt','r')
a = []  
for line in f1:
    a.append(line.strip())
f1.close()

f2 = open('testDB1.txt','r')
b = []
for line in f2:
    b.append(line.strip())
f2.close()

for i in range(min(len(a), len(b))):
    vectorizer = TfidfVectorizer(min_df=1)
    X = vectorizer.fit_transform(a, b)
    idf = vectorizer.idf_
    print dict(zip(vectorizer.get_feature_names(), idf))

The text files include:

testDB.txt =
hello my name is tom
epping is based just outside of london football
epping football club is really bad

testDB1.txt = 
hello my name is tom
i live in chelmsford and i play football
chelmsford is a lovely city

The output:

{u'based': 1.6931471805599454, u'name': 1.6931471805599454, u'just': 1.6931471805599454, u'outside': 1.6931471805599454, u'club': 1.6931471805599454, u'of': 1.6931471805599454, u'is': 1.0, u'football': 1.2876820724517808, u'epping': 1.2876820724517808, u'bad': 1.6931471805599454, u'london': 1.6931471805599454, u'tom': 1.6931471805599454, u'my': 1.6931471805599454, u'hello': 1.6931471805599454, u'really': 1.6931471805599454}
{u'based': 1.6931471805599454, u'name': 1.6931471805599454, u'just': 1.6931471805599454, u'outside': 1.6931471805599454, u'club': 1.6931471805599454, u'of': 1.6931471805599454, u'is': 1.0, u'football': 1.2876820724517808, u'epping': 1.2876820724517808, u'bad': 1.6931471805599454, u'london': 1.6931471805599454, u'zain': 1.6931471805599454, u'my': 1.6931471805599454, u'hello': 1.6931471805599454, u'really': 1.6931471805599454}
{u'based': 1.6931471805599454, u'name': 1.6931471805599454, u'just': 1.6931471805599454, u'outside': 1.6931471805599454, u'club': 1.6931471805599454, u'of': 1.6931471805599454, u'is': 1.0, u'football': 1.2876820724517808, u'epping': 1.2876820724517808, u'bad': 1.6931471805599454, u'london': 1.6931471805599454, u'tom': 1.6931471805599454, u'my': 1.6931471805599454, u'hello': 1.6931471805599454, u'really': 1.6931471805599454}

As you can see it does the TF-IDF for the whole documents for both text files rather than per line. I have tried to implement per line using the for loop but i cannot figure out the problem.

Ideally the output would print the TF-IDF per line. Eg

u'hello': 0.23123, u'my': 0.3123123, u'name': '0.2313213, u'is': 0.3213132, u'tom': 0.3214344

etc.

If anyone can help me or give any advice that would be great.

You don't seem to have a "following link" (you seem to have pasted a second copy of the `import` statement, which I edited out); could you please edit the question to include the URL you wanted to link to? — tripleee, Apr 08 '15 at 11:04
Passing a line instead of an array of lines should be a no-brainer. But it's not clear why you have two files or how the lines in them relate to each other. Should the program ultimately pair up lines from two files, or read all lines from both files and use that as the database from which to calculate the IDF? — tripleee, Apr 08 '15 at 11:08
the two files are to be compared with each other. So the first line of each file will be paired up, the second line in each file will be paired up etc. The program should pair up lines from the two files and calculate the IDF for each paired line. Eg. 'hello my name is tom' from testDB.txt and 'hello my name is tom' from testDB1.txt will be the first pair. If that makes sense? — Zaino, Apr 08 '15 at 18:22

score 1 · Answer 1 · answered Apr 08 '15 at 11:37

1

Ehm... here you're passing a and b:

for i in range(min(len(a), len(b))):
    vectorizer = TfidfVectorizer(min_df=1)
    X = vectorizer.fit_transform(a, b)
    idf = vectorizer.idf_
    print dict(zip(vectorizer.get_feature_names(), idf))

When a and b are arrays... (list of strings). What you could do is this:

for i in range(min(len(a), len(b))):
    vectorizer = TfidfVectorizer(min_df=1)
    X = vectorizer.fit_transform(a[i], b[i])
    idf = vectorizer.idf_
    print dict(zip(vectorizer.get_feature_names(), idf))

But as it is mentioned in the comments it is not clear what you are doing...

answered Apr 08 '15 at 11:37

XapaJIaMnu

1,408
3
12
28

This looks like what the OP wants, but it doesn't really make any sense. – tripleee Apr 08 '15 at 18:40
apologies I seem to have misunderstood tf-idf. – Zaino Apr 13 '15 at 10:54

TF-IDF by string line rather than whole text document

1 Answers1