spacy similarity bigger than 1

Question

The spaCy similarity works strange sometimes. If we compare the completely equal texts, we got a score of 1.0. but the texts are almost equal we can get a score > 1. This behavior could harm our code. Why we got this > 1.0 score and can we predict it?

def calc_score(text_source, text_target):
    return nlp(text_source).similarity(nlp(text_target))

# nlp = spacy.load('en_core_web_md')
calc_score('software development', 'Software development')
# 1.0000000155153665

That looks like it's due to floating point inaccuracies. You could just clip values to the desired range (e.g. maximum of 1) using something like `numpy.clip`. — xdurch0, Sep 06 '19 at 18:27
You don't have *completely equal texts*!. The first is **s**oftware and the second **S**oftware. If you use the *same* texts you get a correct 1.0 — Aris F., Sep 07 '19 at 21:12
@Aris F. The question is not about 1 score for the equal texts, but about score > 1. Which can be translated as "better than equal", right? We have the full, the complete, the perfect score 1, as "completely equal". And, what? here is an even better score, >1, better than perfect! — Leonid Ganeline, Sep 09 '19 at 00:22

score 1 · Answer 1 · answered Sep 07 '19 at 22:05

From https://spacy.io/usage/vectors-similarity:

Identical tokens are obviously 100% similar to each other (just not always exactly 1.0, because of vector math and floating point imprecisions).

Just use np.clip as per https://stackoverflow.com/a/13232356/447599

spacy similarity bigger than 1

1 Answers1