How to improve comaprison quality when using utl_match.jaro_winkler?

Question

I use utl_match.jaro_winkler in order to compare company names. In most cases it works fine, but sometimes I get pretty weird results.

This for example returns 0.62:

utl_match.jaro_winkler('ГОРОДСКАЯ КЛИНИЧЕСКАЯ БОЛЬНИЦА 18','ДИНА');

Those are absolutely different names both by length and symbols! How could it be 62%?

Another example:

SELECT utl_match.jaro_winkler('ООО МЕГИ', 'МЕГИ')

This returns 0! Despite the fact that those are very similar strings.

It feels like I should use something more complicated and advanced than just upper() and utl_match.jaro_winkler(). But I have no idea what exactly.

What would you recommend? What are best practices of comparing two strings? Where I can read about it?

Thanks Руслан Х. String comparison is a rich topic. There are trade-offs among the approaches; domains have different goals and weigh things differently. For example, in some disciplines, transpositions or inversions can be treated as single operations, despite making strings appear very different. I would suggest assessing what kind of comparison applies in your situation (phonetic/structural/etc) and what performance tolerances, and evaluate candidates from there. This has some info: https://stackoverflow.com/questions/25540581/difference-between-jaro-winkler-and-levenshtein-distance — alexgibbs, Jan 28 '19 at 20:59
I've decided to use utl_match.edit_distance_similarity since it gives much better results — Ruslan, Jan 29 '19 at 10:26

How to improve comaprison quality when using utl_match.jaro_winkler?

0 Answers0