1

Anybody aware of a string similarity method that would give the correct results for the below? I'm dealing with alphanumeric IDs where:

  1. a change in the early part of the string matters more than in the latter part. I guess I could do ngrams? Although that might break down in the scenario where one string has a prefix?
  2. The difference in what character gets substrituted matters as changing an "a" to "b" is less of a change than changing it to "c".

Levenstein and Jaro-Winkler don't seem to be doing the right thing.

See example below.

import jellyfish
t1="100"
t21=["100a","a100"] # case 1. expecting: similar, not similar
t22=["101","105","200"] # case 2. expecting: similar, less similar, least similar

fun = jellyfish.levenshtein_distance
print([fun(t1, t) for t in t21]) # all the same
print([fun(t1, t) for t in t22]) # all the same

fun = jellyfish.jaro_winkler
print([fun(t1, t) for t in t21]) # all the same
print([fun(t1, t) for t in t22]) # all the same

For added fun, a scenario where the first string has a prefix which is essentially irrelevant to the string as an ID but messes up string similarity.

t1="pre-100"
t21=["100a","a100"] # expecting: similar, not similar
t22=["101","105","200"] # expecting: similar, less similar, least similar

fun = jellyfish.levenshtein_distance
print([fun(t1, t) for t in t21]) # picks the wrong one
print([fun(t1, t) for t in t22]) # all the same

fun = jellyfish.jaro_winkler
print([fun(t1, t) for t in t21]) # picks the wrong one
print([fun(t1, t) for t in t22]) # picks the right one
citynorman
  • 4,918
  • 3
  • 38
  • 39
  • There may be some rule here, but why is `pre-` totally ignorable? If it's because of the hyphen, then first check for an exact match – I suppose that's remotely possible, although your examples don't have any – and if that fails, preprocess the input string, discarding such prefixes and early-rejecting those with alphabetics at the start. – Jongware Jan 12 '18 at 14:47
  • ... (afterthought) assuming there *are* any rules. It should not be extremely hard to write code that works for the input here, but it's a safe bet that you can come up with ten more on which it would *not* work. – Jongware Jan 12 '18 at 14:49
  • Ignorable in the sense that if I manually cleaned the dataset and merge on the ids (my ultimate goal) I would remove it. – citynorman Jan 12 '18 at 15:45
  • Another case where edit distance doesn't give the correct result https://stackoverflow.com/questions/11980000/best-machine-learning-technique-for-matching-product-strings – citynorman Apr 29 '18 at 16:16

0 Answers0