Using SQL for deduplication, SELECT similar fields

Question

I have an access orders database containing 500+ customer names in a field called "CustomerName"

The problem is that as orders get entered, our sales team sometimes types the names differently. ("ACME Inc" vs "ACME, Inc.") So now I have several customers who are actually the same company, but appear differently.

So far I've written a query that shows distinct CustomerName values, and now I need a query that can basically go through all of these and show me where the names are similar. Perhaps if the first ten letters of the name matches? Or ideally, if a certain percentage of the letters of the name match.

I'm at a loss on how to do this, any help is appreciated.

Thanks so much!

look up 'hamming distance' or 'levenshtein distance'. both are v.useful in figuring out how "different" two strings are. 'acme inc' and 'acme inc.' would have very short ham/lev distances. — Marc B, Jul 17 '13 at 18:18
Going forward, try normalizing your data. There's a reason people do that... — Johnny Bones, Jul 17 '13 at 18:25

j_mcnally · Answer 1 · 2013-07-17T18:25:26.003

1

you could split all the customer columns on " " and join them back with % in a like clause Which would help find ACME, Inc because it would match %ACME%Inc% along with ACME Inc

Its hard for me to test this out without a huge company db with dupes but maybe that idea will help.

edited Jul 17 '13 at 18:25

answered Jul 17 '13 at 18:18

j_mcnally

6,928
2
31
46

also strip out punctuation like . and , - then convert to all upper case - possible even strip out common abbreviations like Inc, LLC, Corp – Randy Jul 17 '13 at 18:53

score 1 · Answer 2 · edited May 23 '17 at 11:49

This depends on how the names vary. If its simply just punctuation, case etc. you can strip those and do a compare based on that. If its more complicated e.g. "Inc, ACME" vs "ACME Incorporated" you would have to write a function to compare the two strings to score/rank them.

Finding how similar two strings are

The above link is to another SO question that talks about the different algorithms generally used to do this.

Levenshtein Distance in Excel

The above link is to a SO question that has an implementation of one of the methods in VBA

score 1 · Answer 3 · answered Jul 17 '13 at 18:46

You only have 500 customer names. Take the unique names and put them in Excel. Sort them and assign the "real" customer name to them.

This should take less than an hour.

Then, upload this to a new table in your database. The new table would have the "real customer name" and the "alternative names". When accessing the table, do a join to get the real name.

This fixes the immediate problem.

Then, you can update the data in the database to use the "real" name.

And, you can change the input method to one of the following:

Have the sales people choose from a list of valid names. Or, go through a process to create a new customer.
Reject data being uploaded into the system if the customer names are not in the "real" list. Then manually add the name, if a real new customer, and reload the data.

You are dealing with 500 names, not 500,000 names. A manual process is quite feasible. Trying to get SQL to do match similar names is hard. You will spend more time doing that than actually fixing the problem.

score 0 · Answer 4 · answered Jul 17 '13 at 18:24

This syntax is for sql server. The logic will be the same for access. I'm not sure about the syntax. It's a self join.

select your fields
from yourtable t1 join yourtable t2
on left(t1.namefield, 10) = left(t1.namefield, 10)
and t1.namefield <> t2.namefield

That gives you records where the first 10 characters are the same. However, for what you are attempting, I don't think it's sufficiently thorough. I think you have to go through your entire db.

Also, you might want to consider fixing the future before you fix the past. In other words, do something to prevent recurrence so that you only have to do this job once.

Using SQL for deduplication, SELECT similar fields

4 Answers4