0

I have followed the explanation of Fred Foo in this stack overflow question: How to compute the similarity between two text documents?

I have run the following piece of code that he wrote:

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = ["I'd like an apple",
          "An apple a day keeps the doctor away",
          "Never compare an apple to an orange",
          "I prefer scikit-learn to Orange",
          "The scikit-learn docs are Orange and Blue"]
vect = TfidfVectorizer(min_df=1, stop_words="english")
tfidf = vect.fit_transform(corpus)
pairwise_similarity = tfidf * tfidf.T
print(pairwise_similarity.toarray())

And the result is:

[[1.         0.17668795 0.27056873 0.         0.        ]
 [0.17668795 1.         0.15439436 0.         0.        ]
 [0.27056873 0.15439436 1.         0.19635649 0.16815247]
 [0.         0.         0.19635649 1.         0.54499756]
 [0.         0.         0.16815247 0.54499756 1.        ]]

But what I noticed is that when I set corpus to be:

corpus = ["I'd like an apple",
          "An apple a day keeps the doctor away"]

and run the same code again, I get the matrix:

[[1.         0.19431434]
 [0.19431434 1.        ]]

Thus their similarity changes (in the first matrix, their similarity is 0.17668795). Why is that the case? I am really confused. Thank you in advance!

Petar
  • 195
  • 2
  • 14
  • it calculates similarity using all words in corpus - result depends on number of words in all sentences in corpus. When you have less words in corpus then similarity can be different. – furas May 14 '21 at 16:18
  • if you put the same sentence in corpus then it will also change results. if you put it two times then it will also change results. It check not only how two sentences are similar but also how they different to rest of sentences – furas May 14 '21 at 16:31
  • Oh, okay okay, that makes perfect sense! Thank you for your clear and straightforward explanation. Please put it as an answer and I will accept it! :) have a good one! – Petar May 15 '21 at 06:54

1 Answers1

1

In Wikipedia you can see how to calculate Tf-idf


enter image description here

enter image description here

enter image description here


N - number of documents in corpus.

So similarity depends on number of all documents/sentences in corpus.

If you have more documents/sentences then it changes results.

If you add the same document/sentence few times then it also changes results.

furas
  • 134,197
  • 12
  • 106
  • 148