0

The spaCy similarity works strange sometimes. If we compare the completely equal texts, we got a score of 1.0. but the texts are almost equal we can get a score > 1. This behavior could harm our code. Why we got this > 1.0 score and can we predict it?

def calc_score(text_source, text_target):
    return nlp(text_source).similarity(nlp(text_target))

# nlp = spacy.load('en_core_web_md')
calc_score('software development', 'Software development')
# 1.0000000155153665
Leonid Ganeline
  • 617
  • 6
  • 17
  • 1
    That looks like it's due to floating point inaccuracies. You could just clip values to the desired range (e.g. maximum of 1) using something like `numpy.clip`. – xdurch0 Sep 06 '19 at 18:27
  • You don't have *completely equal texts*!. The first is **s**oftware and the second **S**oftware. If you use the *same* texts you get a correct 1.0 – Aris F. Sep 07 '19 at 21:12
  • @Aris F. The question is not about 1 score for the equal texts, but about score > 1. Which can be translated as "better than equal", right? We have the full, the complete, the perfect score 1, as "completely equal". And, what? here is an even better score, >1, better than perfect! – Leonid Ganeline Sep 09 '19 at 00:22

1 Answers1

1

From https://spacy.io/usage/vectors-similarity:

Identical tokens are obviously 100% similar to each other (just not always exactly 1.0, because of vector math and floating point imprecisions).

Just use np.clip as per https://stackoverflow.com/a/13232356/447599

Jules G.M.
  • 3,624
  • 1
  • 21
  • 35