4

I am struggling the definition of common string between two strings when applying Jaro string similarity algorithm.

say we have

 s1 = 'profjohndoe'
 s2 = 'drjohndoe'

BY Jaro similarity, the half length is floor(11/2) - 1 = 4, defined by the algorithm, s1[i] = s2[j] is counted to be common, if abs(i-j)<=4

then mapping matrix is

  p r o f j o h n d o e
d 0 0 0 0 0 0 0 0 0 0 0
r 0 1 0 0 0 0 0 0 0 0 0
j 0 0 0 0 1 0 0 0 0 0 0
o 0 0 1 0 0 1 0 0 0 0 0
h 0 0 0 0 0 0 1 0 0 0 0
n 0 0 0 0 0 0 0 1 0 0 0
d 0 0 0 0 0 0 0 0 1 0 0
o 0 0 0 0 0 1 0 0 0 1 0
e 0 0 0 0 0 0 0 0 0 0 1

therefore:

char_ins1_canfound_ins2 would be 'rojohndoe' (in their presented order in s1);
char_ins2_canfound_ins1 would be 'rjohndoe' (in their presented order in s2).

Now I am having a case that common char strings have non equal lengths, how to deal with that?

if applying 'stringdist' function in R 'stringdist' pack, will obtain the following result:

> 1 - stringdist('profjohndoe','drjohndoe',method='jw')
[1] 0.7887205

which appear to be:

1/3*(8/9+8/11+(8-2)/8) [1] 0.7887205

above outcome indicates stringdist counted common string with length 8. Following this fact, if I massage char_ins1_canfound_ins2 to be 'rojohnde', there should be 6 transpositions, which should yield to 1/3*(8/9+8/11+(8-3)/8) if I massage char_ins1_canfound_ins2 to be 'rojhndoe', there should be 2 transpositions, which should yield to 1/3*(8/9+8/11+(8-1)/8)

How is R stringdist function deals with above situation?

Millions of thanks!

Steven Beaupré
  • 21,343
  • 7
  • 57
  • 77

0 Answers0