I am not very experienced with Python but am using it for a project I am doing. The project involves measuring the similarity of different texts by text processing (cleaning) and then eventually implement cosine similarity, jaccard similarity and the tf-idf. I've seen a lot of useful information on google and on stack overflow but if there are any other existing links/references which could help me that would be great.
I'm trying to work out the cosine similarity between each tweet in two different text files. I have used the structure which can be seen on How to calculate cosine similarity given 2 sentence strings? - Python for the cosine implementation.
Each text file 'Prius.txt' and 'lexus.txt' have 100 tweets in each file. I have converted each line in the text files as two separate lists and trying to work out the cosine similarity between each tweet in each file as pairs.
f1 = open('prius.txt','r')
a = []
for line in f1:
a.append(line.strip())
f1.close()
f2 = open('lexus.txt','r')
b = []
for line in f2:
b.append(line.strip())
f2.close()
Eg. the first tweet in 'Prius.txt' will be compared with the first tweet in 'lexus.txt' and so on until the last tweet, 100th tweet, in the 'prius.txt' file is compared to the last tweet in 'lexus.txt' file.
I am having trouble writing a for loop which will iterate for each line in the in the list to print the cosine similarity but I am having trouble. It understand I am nearly there but having difficulty. Below is psuedocode of my attempt.
vector1 = text_to_vector(a)
vector2 = text_to_vector(b)
for file1 in a:
for file2 in b:
cosine = get_cosine(vector1, vector2)
print 'Cosine:', cosine
If anyone can help me or advise me that would be great.