I am struggling the definition of common string between two strings when applying Jaro string similarity algorithm.
say we have
s1 = 'profjohndoe'
s2 = 'drjohndoe'
BY Jaro similarity, the half length is floor(11/2) - 1 = 4
, defined by the algorithm, s1[i] = s2[j]
is counted to be common, if abs(i-j)<=4
then mapping matrix is
p r o f j o h n d o e
d 0 0 0 0 0 0 0 0 0 0 0
r 0 1 0 0 0 0 0 0 0 0 0
j 0 0 0 0 1 0 0 0 0 0 0
o 0 0 1 0 0 1 0 0 0 0 0
h 0 0 0 0 0 0 1 0 0 0 0
n 0 0 0 0 0 0 0 1 0 0 0
d 0 0 0 0 0 0 0 0 1 0 0
o 0 0 0 0 0 1 0 0 0 1 0
e 0 0 0 0 0 0 0 0 0 0 1
therefore:
char_ins1_canfound_ins2 would be 'rojohndoe' (in their presented order in s1);
char_ins2_canfound_ins1 would be 'rjohndoe' (in their presented order in s2).
Now I am having a case that common char strings have non equal lengths, how to deal with that?
if applying 'stringdist' function in R 'stringdist' pack, will obtain the following result:
> 1 - stringdist('profjohndoe','drjohndoe',method='jw')
[1] 0.7887205
which appear to be:
1/3*(8/9+8/11+(8-2)/8) [1] 0.7887205
above outcome indicates stringdist counted common string with length 8. Following this fact,
if I massage char_ins1_canfound_ins2
to be 'rojohnde', there should be 6 transpositions, which should yield to 1/3*(8/9+8/11+(8-3)/8)
if I massage char_ins1_canfound_ins2
to be 'rojhndoe', there should be 2 transpositions, which should yield to 1/3*(8/9+8/11+(8-1)/8)
How is R stringdist
function deals with above situation?
Millions of thanks!