0

I use utl_match.jaro_winkler in order to compare company names. In most cases it works fine, but sometimes I get pretty weird results.

This for example returns 0.62:

utl_match.jaro_winkler('ГОРОДСКАЯ КЛИНИЧЕСКАЯ БОЛЬНИЦА 18','ДИНА'); 

Those are absolutely different names both by length and symbols! How could it be 62%?

Another example:

SELECT utl_match.jaro_winkler('ООО МЕГИ', 'МЕГИ')

This returns 0! Despite the fact that those are very similar strings.

It feels like I should use something more complicated and advanced than just upper() and utl_match.jaro_winkler(). But I have no idea what exactly.

What would you recommend? What are best practices of comparing two strings? Where I can read about it?

Ruslan
  • 393
  • 1
  • 14
  • 1
    Thanks Руслан Х. String comparison is a rich topic. There are trade-offs among the approaches; domains have different goals and weigh things differently. For example, in some disciplines, transpositions or inversions can be treated as single operations, despite making strings appear very different. I would suggest assessing what kind of comparison applies in your situation (phonetic/structural/etc) and what performance tolerances, and evaluate candidates from there. This has some info: https://stackoverflow.com/questions/25540581/difference-between-jaro-winkler-and-levenshtein-distance – alexgibbs Jan 28 '19 at 20:59
  • Thank you, @alexgibbs! That's interesting. – Ruslan Jan 29 '19 at 07:10
  • I've decided to use utl_match.edit_distance_similarity since it gives much better results – Ruslan Jan 29 '19 at 10:26

0 Answers0