15

I am trying to use the Jaro-Winkler similarity distance to see if two strings are similar. I tried using both the libraries to compare the words carol and elephant. The results are not similar:

import jellyfish

jellyfish.jaro_winkler('Carol','elephant') 

returns 0.4416666, while

from pyjarowinkler import distance

distance.get_jaro_distance('Carol','elephant')

returns 0.0 which makes more sense to me.

Is there a bug between the two libraries?

bad_coder
  • 11,289
  • 20
  • 44
  • 72
turtle_in_mind
  • 986
  • 1
  • 18
  • 36
  • The implementations seem to be incompatible. `jellyfish.jaro_winkler('test', 'rest')` and `distance.get_jaro_distance('test', 'rest')` produce different outputs. I would find some third library to see which implementation is correct. – Blender Jan 24 '18 at 18:20
  • Just posted it here before somebody only reads the comment above. Please see may answer below. Jellyfish ist correct. I linked the original paper about jaro winkler distance. – Bierbarbar Mar 12 '21 at 08:58

2 Answers2

7

The Jellyfish implemenation is correct.

Carol and elephant didn't have a matching prefix. Therefore the Jaro-Winkler distance is equal to the Jaro distance in this Case. I calculated the Jaro distance by hand and found that the implementation of Jellyfish is correct. There is an online calculator, but the online calculator is also wrong. I also found some other implementations like in the python-Levenstein package, wich also implements the Jaro-Winkler distance, that validated my calculations. There is also an implemenatation on npm. If you like to compute the score by you own - you can find the paper here

Bierbarbar
  • 1,399
  • 15
  • 35
2

Perhaps worth noting that two different implementations in R seem to match the jellyfish version:

library(stringdist)
> 1 - stringdist("Elephant", "Carol", method = 'jw')
[1] 0.4416667

library(RecordLinkage)
> jarowinkler('Carol','elephant')
[1] 0.4416667
Community
  • 1
  • 1
AidanGawronski
  • 2,055
  • 1
  • 14
  • 24