Trying to find a way to group duplicates with minor variances

Question

I'm trying to normalize a table based on addresses and account name. The issue is that addresses may be truncated, or abbreviated, or have errors. I did a group by address for the account, which got me closer, but I have thousands of records with dups.

I want the addresses to be grouped and to have a unique account name for each group of addresses. I'm going to take the duplicates and add them has contacts for the account using the account as my fk.

Here's what I have so far, but it isn't nearly close enough. Someone has had to of done this before. I very much appreciate any tips.

                SELECT *,
                --SELECT DISTINCT k.Placeholder,f.Address_1,
                CASE
                    WHEN k.Placeholder IS NULL
                    THEN Clinic_Name
                    ELSE k.Placeholder
                END AS AccountName
                FROM [FL_Data].[dbo].[26K] f
                LEFT JOIN
                    ( SELECT fl.Address_1, 'Placeholder-' + CAST(MIN(fl.id) as varchar) as Placeholder
                      FROM [FL_Data].[dbo].[26K] fl
                      GROUP BY fl.Address_1
                      having count(*) > 1
                    ) k
                ON f.Address_1 = k.Address_1

Here are some examples of the addresses:

3011 NW 63rd St
3011 NW 63rd Street
3013 Winghaven
3013 WINGHAVEN BLVD
301 Northlake Ave. Ste-101
301 Northlake Avenue Ste.101

Take a peek at this: http://stackoverflow.com/questions/41249742/address-standardization-within-a-database/41249971#41249971 — John Cappelletti, Feb 02 '17 at 18:59
can you maybe post some sample data, specifically ones you are having issues with? — Fuzzy, Feb 02 '17 at 19:00
I think John has kind of pointed me down the path I have to go. Standardizing the addresses seems inevitable. — user1040975, Feb 02 '17 at 19:15
I use to work as a Direct Mail programmer. We had lots of volume. So if you are talking large files... we used `Code1 Plus` software to standardize addresses. They also had utilities to correct names based on address. You have no idea how many permutations there are of "R O Donnal" "Rose ODonnal" "R. O'Donnal" etc etc. They also had utilities for unduping the file. Then after that, we use to send the final list out to get it NCOA (National Change Of Address). Addresses are messy. There must be freeware out there now. Good luck finding a solution! — abraxascarab, Feb 02 '17 at 19:28
I have no idea how many of the variances there are and currently I have 26,000+ records to normalize. I'm thinking of writing an app to import the rest as I've about 8k left. Thanks for tip. I'll look and see if there is a free app. — user1040975, Feb 02 '17 at 20:11
Did you already try to take a substring of maybe 15 or 20 characters and then use the SOUNDEX function with a certain offset? This might help to group even if there are minor differences in the name... — Tyron78, Feb 03 '17 at 12:17
Yes, it would work well for some cases and would break others. — user1040975, Feb 06 '17 at 16:09

score 0 · Answer 1 · answered Feb 05 '17 at 14:26

As already mentioned - you can use the google maps API: https://maps.googleapis.com/maps/api/geocode/json?address=

the output will be identical (for example the ""place_id" or the geometry details). According to that - you cam find the duplicates.

Regarding the access - you can write a simple method in C# that received an address as input and send it to the google API. The method can be compiled as a CLR function. That way you can compare all of the rows. It will be a bit time consuming, but if it's an async process - it should be ok.

Trying to find a way to group duplicates with minor variances

1 Answers1