5

We have a column for street addresses:

123 Maple Rd.
321 1st Ave.
etc...

Is there any way to match these addresses to a given input? The input would be a street address, but it might not be in the same format. For example:

123 Maple Road
321 1st Avenue

Our first thought is to strip the input of all street terms (rd, st, ave, blvd, etc).

Obviously that won't match reliably all the time. Are there any other ways to try to match street addresses in SQL server?

We can use user defined functions, stored procs and regular old t-sql. We cannot use clr.

Braiam
  • 1
  • 11
  • 47
  • 78
dtc
  • 10,136
  • 16
  • 78
  • 104
  • Sounds like a job for Full Text Searching. What version of SQL Server? – OMG Ponies Jan 19 '10 at 21:45
  • You might want to peruse the USPS's addressing standards doc ... might give you an idea of some of the possibilities. http://pe.usps.gov/text/pub28/welcome.htm – Seth Jan 19 '10 at 21:49

9 Answers9

5

Rather than stripping out the things that can be variable, try to convert them to a "canonical form" that can be compared.

For example, replace 'rd' or 'rd.' with 'road' and 'st' or 'st.' with 'street' before comparing.

Mark Byers
  • 811,555
  • 193
  • 1,581
  • 1,452
  • 2
    'st johns lane' ->'street johns lane' – Andrew Feb 17 '16 at 19:37
  • That's why I go the opposite direction and convert from long to short. So Saint Johns Lane -> St Johns Lane. The biggest problem there is when people use non-standard abbreviations, or a word gets chopped because of a max field length somewhere upstream (you get "Avenu" instead of "Avenue"). There are no silver bullets. – jep Aug 02 '16 at 20:42
5

You may want to consider using the Levenshtein Distance algorithm.

You can create it as a user-defined function in SQL Server, where it will return the number of operations that need to be performed on String_A so that it becomes String_B. You can then compare the result of the Levenshtein Distance function against some fixed threshold, or against some value derived from the length of the strings.

You would simply use it as follows:

... WHERE LEVENSHTEIN(address_in_db, address_to_search) < 5;

As Mark Byers suggested, converting variable terms into canonical form will help if you use Levenshtein Distance.

Using Full-Text Search may be another option, especially since Levenshtein would normally require a full table scan. This decision may depend on how frequently you intend to do these queries.

You may want to check out the following Levenshtein Distance implementation for SQL Server:

Note: You would need to implement a MIN3 function for the above implementation. You can use the following:

CREATE FUNCTION MIN3(@a int, @b int,  @c int)
RETURNS int
AS
BEGIN
    DECLARE @m INT
    SET @m = @a

    IF @b < @m SET @m = @b
    IF @c < @m SET @m = @c

    RETURN @m
END

You may also be interested in checking out the following articles:

Community
  • 1
  • 1
Daniel Vassallo
  • 337,827
  • 72
  • 505
  • 443
  • Great answer. Warning, though. The link to the TSQL implementation is broken. Sometimes it opens in a scam page. – cockypup Mar 18 '16 at 17:06
4

In order to do proper street address matching, you need to get your addresses into a standardized form. Have a look at the USPS postal standards here (I'm asssuming you're dealing with US addresses). It is by no means an easy process if you want to be able to deal with ALL types of US mail addresses. There is software available from companies like QAS and Satori Software that you can use to do the standardization for you. You'll need to export your addresses, run them through the software and then load the database with the updated addresses. There are also third party vendors that will perform the address standardization as well. It may be overkill for what you are trying to do but it's the best way to do it. if the addresses in your database are standardized you'll have a better chance of matching them (especially if you can standardize the input as well).

TLiebe
  • 7,913
  • 1
  • 23
  • 28
  • This is the way we went. After trying (no successfully) to match addresses we finally convinced people that it's best not to go down that road. Instead we are using some type of 3rd party software that standardizes addresses for us. – dtc Feb 11 '10 at 02:03
3

I think the first step for you is to better define how generous or not you're going to be regarding differing addresses. For example, which of these match and which don't:

123 Maple Street
123 Maple St
123 maple street
123 mpale street
123 maple
123. maple st
123 N maple street
123 maple ave
123 maple blvd

Are there both a Maple Street and a Maple Blvd in the same area? What about Oak Street vs Oak Blvd.

For example, where I live there many streets/roads/blvds/ave that are all named Owasso. I live on Owasso Street, which connects to North Owasso Blvd, which connects to South Owasso Blvd. However, there is only one Victoria Ave.

Given that reality, you must either have a database of all road names, and look for the closest road (and deal with the number seperately)

OR

Make an decision ahead of time what you'll insist on and what you won't.

David Oneill
  • 12,502
  • 16
  • 58
  • 70
3

Address matching and deduplication is a messy business. Other posters are correct when they say that the addresses need to be standardized first to the local postal standards authority (The USPS for example if it is a US addresses). Once the addresses are in standard format the rest is easy.

There are several third-party services which will flag duplicates in a list for you. Doing this solely with a MySQL subquery will not account for differences in address formats and standards. The USPS (for US address) has certain guidelines to make these standard, but only a handful of vendors are certified to perform such operations.

So, I would recommend the best answer for you is to export the table into a CSV file, for instance, and submit it to a capable list processor. One such is SmartyStreets' Bulk Address Validation Tool which will have it done for you in a few seconds to a few minutes automatically. It will flag duplicate rows with a new field called "Duplicate" and a value of Y in it.

Try standardizing and validating a couple of addresses here to get an idea for what the output will look like.

Full Disclosure: I work for SmartyStreets

Davin
  • 199
  • 6
  • 12
2

Stripping out data is a bad idea. Many towns will have dozens of variations of the same street - Oak Street, Oak Road, Oak Lane, Oak Circle, Oak Court, Oak Avenue, etc... As mentioned above converting to the canonical USPS abbreviation is a better approach.

Jason
  • 86,222
  • 15
  • 131
  • 146
1

Fuzzy Lookups and Groupings Provide Powerful Data Cleansing Capabilities

Paul Creasey
  • 28,321
  • 10
  • 54
  • 90
1

You could try SOUNDEX to see if that gets you close. http://msdn.microsoft.com/en-us/library/aa259235%28SQL.80%29.aspx

Greg
  • 16,540
  • 9
  • 51
  • 97
0

You may also checkout COMPGED function - https://www.sas.com/content/dam/SAS/support/en/sas-global-forum-proceedings/2018/2487-2018.pdf

  • 1
    As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Feb 11 '23 at 23:52