Extracting each line from two separate lists to work out cosine similarity

Question

I am not very experienced with Python but am using it for a project I am doing. The project involves measuring the similarity of different texts by text processing (cleaning) and then eventually implement cosine similarity, jaccard similarity and the tf-idf. I've seen a lot of useful information on google and on stack overflow but if there are any other existing links/references which could help me that would be great.

I'm trying to work out the cosine similarity between each tweet in two different text files. I have used the structure which can be seen on How to calculate cosine similarity given 2 sentence strings? - Python for the cosine implementation.

Each text file 'Prius.txt' and 'lexus.txt' have 100 tweets in each file. I have converted each line in the text files as two separate lists and trying to work out the cosine similarity between each tweet in each file as pairs.

f1 = open('prius.txt','r')
a = []  
for line in f1:
    a.append(line.strip())
f1.close()

f2 = open('lexus.txt','r')
b = []
for line in f2:
    b.append(line.strip())
f2.close()

Eg. the first tweet in 'Prius.txt' will be compared with the first tweet in 'lexus.txt' and so on until the last tweet, 100th tweet, in the 'prius.txt' file is compared to the last tweet in 'lexus.txt' file.

I am having trouble writing a for loop which will iterate for each line in the in the list to print the cosine similarity but I am having trouble. It understand I am nearly there but having difficulty. Below is psuedocode of my attempt.

vector1 = text_to_vector(a)
vector2 = text_to_vector(b)

for file1 in a:
    for file2 in b:
        cosine = get_cosine(vector1, vector2)
        print 'Cosine:', cosine

If anyone can help me or advise me that would be great.

pzp · Accepted Answer · 2015-04-02T14:25:17.010

I think this is what you want:

for i in range(min(len(a), len(b))):
    v1, v2 = text_to_vector(a[i]), text_to_vector(b[i])
    cosine = get_cosine(v1, v2)
    print 'Cosine:', cosine

i is just a number that gets incremented from 0 to the length of the smaller list noninclusive (in this case it would be 99). Then v1 and v2 are the values of calling text_to_vector() on the ith item of file1 and file2 respectively.

I'd also recommend that you read the files like so, although your way works too:

with f1 as open('prius.txt','r'):
    a = f1.readlines()
with f2 as open('lexus.txt','r'):
    b = f2.readlines()

Extracting each line from two separate lists to work out cosine similarity

1 Answers1