0

I tokenized multiple text files and created a tf-idf matrix from that:

Token 1 Token 2 Token 3
Doc 1  0.00..   0.0002  0.0003
Doc 2  0.00..   ...     ...
Doc 3  ...      ...     ...
...

How do I now formulate a query, say for token 1 and token 3?

How do I then rank them using cosine similarity?

Billal Begueradj
  • 20,717
  • 43
  • 112
  • 130
adw
  • 11
  • 1

1 Answers1

0

If you are trying to rank them per Doc I suggest using a tuple per Token. I'm not certain on the maths behind cosine similarity but assuming we can use a function f(x,y) that returns the cosine similarity between x and y, we can apply this to Token 1 to Token 2 and 3 as per your suggestion as follows:

list_with_scores = []
for i,doc in enumerate(Docs):
    score1_3 = f(doc[0],doc[1])
    score1_3 = f(doc[0],doc[2])
    list_with_scores.append(i,score1_3, score_2_3,)
#then sort by score1_3
sortedlist1 = sorted(list_with_scores, key = lambda x:x[1])
#similary, sort by score2_3
sortedlist2 = sorted(list_with_scores, key = lambda x:x[2])

You can also keep the token score in the tuple if required. And the explicit saving to score1_3 and score1_2 can be removed, it's done for readiblity, probably better to leave them as is. For more information regarding the sorting part, check Sort a list of tuples by 2nd item (integer value)

Community
  • 1
  • 1
Daniel Lee
  • 7,189
  • 2
  • 26
  • 44