4

I'm trying to join a set of county names from one table with county names in another table. The issue here is that, the county names in both tables are not normalized. They are not same in count; also, they may not be appearing in similar pattern always. For instance, the county 'SAINT JOHNS' in "Table A" may be represented as 'ST JOHNS' in "Table B". We cannot predict a common pattern for them.

That means , we cannot use "equal to" (=) condition while joining. So, I'm trying to join them using the JARO_WINKLER_SIMILARITY function in oracle. My Left Outer Join condition would be like:

Table_A.State = Table_B.State 
AND UTL_MATCH.JARO_WINKLER_SIMILARITY(Table_A.County_Name,Table_B.County_Name)>=80

I've given the measure 80 after some testing of the results and it seemed to be optimal. Here, the issue is that I'm getting set of "false Positives" when joining. For instance, if there are some counties with similarity in names under the same state ("BARRY'and "BAY" for example), they will be matched if the measure is >=80. This creates inaccurate set of joined data. Can anyone please suggest some work around?

Thanks, DAV

APC
  • 144,005
  • 19
  • 170
  • 281
Dav KR
  • 51
  • 1
  • 4
  • Maybe you create a table of known false positives and use a `and (Table_A.State, Table_A.County_Name, Table_B.County_Name) not in (select state, a, b from known_false_positives)` check? The other possibility would be to replace JARO_WINKLER_SIMILARITY with your own implementation that does all the necessary magic... – Erich Kitzmueller Apr 28 '17 at 10:17
  • Thanks for comment. The issue is we need to go through each record to know the false positives. Just now I found out "MONTGOMERY" matches against "MONROE" in same state with a measure of 84. – Dav KR Apr 28 '17 at 10:24
  • Another way to reach you goal could be to use a normalizer function first (this would e.g. replace 'SAINT ' with 'ST ') and then use the JARO_WINKLER_SIMILARITY with a higher threshold. – Erich Kitzmueller Apr 28 '17 at 10:30
  • Yes.. I'm using set of normalizations. normalizing those names with 'SAINT' was one of them. But, still there're set of counties creating issue like "MONTGOMERY" matches against "MONROE". For such issues, its difficult to get a generalized normalization. – Dav KR Apr 28 '17 at 10:33
  • Can you get an official list of all counties in your country? If so, you might first want to match each county in Table_A and Table_B to the appropriate official name (hopefully that with the highest jaro_winkler_similarity is the right one) and than easily compare A and B. – Erich Kitzmueller Apr 28 '17 at 10:37
  • Table_A is the official list. Actually, we need to build a cross reference table to lookup the county names used by different systems. Table_B is one of those systems. There are other systems data from Table-C, Table_D as well. The real requirement is to join these tables against the official (Table_A) data so that we can get a cross reference table with county names from A, B, C and D. The data in tables C, D would also be having similar issues as they are also not in any standardized/normalized form. – Dav KR Apr 28 '17 at 10:46
  • 1
    What happens if you filter to retain only the highest-ranked match above 80 for each row in Table A? Are the results usable then? – William Robertson Apr 28 '17 at 10:58
  • I agree with @WilliamRobertson; if the answer is "no", i.e. a wrong match (often) scores better than the correct match, then the JARO_WINKLER_SIMILARITY is probably not a good choice for your requirement. – Erich Kitzmueller Apr 28 '17 at 11:02
  • @William Robertson, @ammoQ; Thanks for your suggestion. Can you plz help me to build a query that will lookup Table_A for each record in Table B/C/D, and match against the county name in A with highest ranked similarity that is >=80 like William said. I'm sorry; but, I'm not much expertised in SQL – Dav KR Apr 28 '17 at 11:12
  • 1
    Have you tried also adding in a comparison of the SOUNDEX value for each county. E.g. MONROE/MONTGOMERY match at 84 but the SOUNDEX is different. Essentially your problem is you want an accurate match using 'fuzzy' algorithms which are unlikely to always have that accuracy. – BriteSponge Apr 28 '17 at 11:36
  • @BriteSponge is right. Any one fuzzy algorithm will always produce false positives (and indeed false negatives). It's worth having more than one in play. SOUNDEX is creaky but it's an Oracle built-in, so easy. The web has implementations for Metaphone and Double Metaphone, which are evolutions of Soundex, and can be implemented in PL/SQL. The advantage of going down that route is we can tweak the algorithms to include bespoke rules. – APC Apr 28 '17 at 11:46

2 Answers2

3

Can you plz help me to build a query that will lookup Table_A for each record in Table B/C/D, and match against the county name in A with highest ranked similarity that is >=80

Oracle Setup:

CREATE TABLE official_words ( word ) AS
  SELECT 'SAINT JOHNS' FROM DUAL UNION ALL
  SELECT 'MONTGOMERY' FROM DUAL UNION ALL
  SELECT 'MONROE' FROM DUAL UNION ALL
  SELECT 'SAINT JAMES' FROM DUAL UNION ALL
  SELECT 'BOTANY BAY' FROM DUAL;

CREATE TABLE words_to_match ( word ) AS
  SELECT 'SAINT JOHN' FROM DUAL UNION ALL
  SELECT 'ST JAMES' FROM DUAL UNION ALL
  SELECT 'MONTGOMERY BAY' FROM DUAL UNION ALL
  SELECT 'MONROE ST' FROM DUAL;

Query:

SELECT *
FROM   (
  SELECT wtm.word,
         ow.word AS official_word,
         UTL_MATCH.JARO_WINKLER_SIMILARITY( wtm.word, ow.word ) AS similarity,
         ROW_NUMBER() OVER ( PARTITION BY wtm.word ORDER BY UTL_MATCH.JARO_WINKLER_SIMILARITY( wtm.word, ow.word ) DESC ) AS rn
  FROM   words_to_match wtm
         INNER JOIN
         official_words ow
         ON ( UTL_MATCH.JARO_WINKLER_SIMILARITY( wtm.word, ow.word )>=80 )
)
WHERE rn = 1;

Output:

WORD           OFFICIAL_WO SIMILARITY         RN
-------------- ----------- ---------- ----------
MONROE ST      MONROE              93          1
MONTGOMERY BAY MONTGOMERY          94          1
SAINT JOHN     SAINT JOHNS         98          1
ST JAMES       SAINT JAMES         80          1
MT0
  • 143,790
  • 11
  • 59
  • 117
1

Using some made up test data inline (you would use your own TABLE_A and TABLE_B in place of the first two with clauses, and begin at with matches as ...):

with table_a (state, county_name) as
     ( select 'A', 'ST JOHNS' from dual union all
       select 'A', 'BARRY' from dual union all
       select 'B', 'CHEESECAKE' from dual union all
       select 'B', 'WAFFLES' from dual union all
       select 'C', 'UMBRELLAS' from dual )
   , table_b (state, county_name) as
     ( select 'A', 'SAINT JOHNS' from dual union all
       select 'A', 'SAINT JOANS' from dual union all
       select 'A', 'BARRY' from dual union all
       select 'A', 'BARRIERS' from dual union all
       select 'A', 'BANANA' from dual union all
       select 'A', 'BANOFFEE' from dual union all
       select 'B', 'CHEESE' from dual union all
       select 'B', 'CHIPS' from dual union all
       select 'B', 'CHICKENS' from dual union all
       select 'B', 'WAFFLING' from dual union all
       select 'B', 'KITTENS' from dual union all
       select 'C', 'PUPPIES' from dual union all
       select 'C', 'UMBRIA' from dual union all
       select 'C', 'UMBRELLAS' from dual )
   , matches as
     ( select a.state, a.county_name, b.county_name as matched_name
            , utl_match.jaro_winkler_similarity(a.county_name,b.county_name) as score
       from   table_a a
              join table_b b on b.state = a.state  )
   , ranked_matches as
     ( select m.*
            , rank() over (partition by m.state, m.county_name order by m.score desc) as ranking
       from   matches m
       where  score > 50 )
select rm.state, rm.county_name, rm. matched_name, rm.score
from   ranked_matches rm
where  ranking = 1
order by 1,2;

Results:

STATE COUNTY_NAME MATCHED_NAME      SCORE
----- ----------- ------------ ----------
A     BARRY       BARRY               100
A     ST JOHNS    SAINT JOHNS          80
B     CHEESECAKE  CHEESE               92
B     WAFFLES     WAFFLING             86
C     UMBRELLAS   UMBRELLAS           100

The idea is matches computes all scores, ranked_matches assigns them a sequence within (state, county_name), and the final query picks all the top scorers (i.e. filters on ranking = 1).

You may still get some duplicates as there is nothing to stop two different fuzzy matches scoring the same.

William Robertson
  • 15,273
  • 4
  • 38
  • 44
  • Damn, MTO beat me to it. Mine attempts to avoid repeating the JWS function using an extra nesting level, but I have no idea whether that will make any difference to performance in practice. – William Robertson Apr 28 '17 at 12:06