I have dirty data from 2 different sources. I'm looking for some best practices around matching them. Here are some examples of the data:
Source1.Name Source2.Name
Adda Clevenger Jr Prep School Adda Clevenger Junior Preparatory School
Alice Fong Yu Alt School Alice Fong Yu Alternative School
Convent Of Sacred Heart Es Convent of Sacred Heart Elementary School
Rosa Parks Elementary School Rosa Parks Elementary School
A human can see that these 4 examples should match up with an ideal fuzzy match. I have at my disposal excellent software for traditional fuzzy matching which will catch typos and other small variations. But in this data set I have about a dozen rules governing abbreviations like 'Preparatory' -> 'Prep'. I would like to capture all of those rules in the query. (Then I'll deal with the more traditional fuzziness separately.)
Is there a well-known SQL pattern for handling this requirement? It could be as simple as learning the magic keyword which will unlock examples in my searches. It's a sort of 'translation table' or 'abbreviation table', but I just made those terms up. I haven't found the widely accepted term yet.
Conceptually my goal is to start from this naive query:
/* This succeeds for 1 record and fails for 3 in the sample data set above. */
SELECT * FROM ...
WHERE Source1.Name = Source2.Name
Then modify it into something that gets all of the desired matches shown above. I expect that I could brute-force it with some nested REPLACE functions:
/* This works for the 4 samples given */
SELECT * FROM ...
WHERE
REPLACE( REPLACE( REPLACE( Source1.Name, 'Preparatory', 'Prep' ), 'Alternative', 'Alt' ), 'Elementary School', 'Es' )
= REPLACE( REPLACE( REPLACE( Source2.Name, 'Preparatory', 'Prep' ), 'Alternative', 'Alt' ), 'Elementary School', 'Es' )
This does not feel elegant. It has increasing ugliness as I account for inconsistent abbreviations (e.g. 'International' is sometimes 'Intl' and sometimes 'Int''l'). And it's not particularly smooth for overlapping abbreviations (e.g. 'Elementary School' -> 'Es' but in other cases 'School' -> 'Sch').
How have others solved this?
Note: I'm using Oracle. I would likely use REGEXP_REPLACE rather than REPLACE. I would certainly use UPPER (or LOWER) to avoid case issues. But those details aren't core to the issue.