3

To check the customer address against to our SQL DB address list,to tell the address existed in our DB or not. The customer address comes in different format.

e.g:- 8455 W 23RD st OFFICE 4
     8455 WEST 23RD st OFC 4
     8455 WEST TWENTY-THIRD street OFC 4

If I do string compare,I will get bExist = False;

Note:- I am using C#, The server were the application run has no internet connection. Additional eg:-

APARTMENT   APT
APARTMENT   APARTMENT
BUILDING    BLDG
BUILDING    BUILDING
DEPARTMENT  DEPT
DEPARTMENT  DEPARTMENT
FLOOR   FL
FLOOR   FLOOR
HANGAR  HNGR
HANGAR  HANGAR
LOT LOT
OFFICE  OFC
OFFICE  OFFICE
PIER    PIER
PIER    PIER
ROOM    RM
ROOM    ROOM
SLIP    SLIP
SPACE   SPC
SPACE   SPACE
STOP    STOP
SUITE   STE
SUITE   SUITE
TRAILER TRLR
TRAILER TRAILER
UNIT    UNIT

N   NORTH
S   SOUTH
E   EAST
W   WEST
NE  NORTHEAST
SE  SOUTHEAST
NW  NORTHWEST
SW  SOUTHWEST
NORTH   NORTH
SOUTH   SOUTH
EAST    EAST
WEST    WEST
NORTHEAST   NORTHEAST
SOUTHEAST   SOUTHEAST
NORTHWEST   NORTHWEST
SOUTHWEST   SOUTHWEST
NORTH EAST  NORTHEAST
SOUTH EAST  SOUTHEAST
NORTH WEST  NORTHWEST
SOUTH WEST  SOUTHWEST
Keith Nicholas
  • 43,549
  • 15
  • 93
  • 156
Habesha
  • 465
  • 5
  • 9
  • Please provide some code! (and welcome to stackoverflow :) ) – Odys Jun 13 '13 at 21:59
  • You could use levenshtein distance to calculate similarity. Here's an implementation: http://stackoverflow.com/a/560733/284240 – Tim Schmelter Jun 13 '13 at 22:03
  • You have to do fuzzy matches. Use `Contains`, `StartsWith` and `EndsWith` for a basic system. [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) for a fully featured system. – ChrisF Jun 13 '13 at 22:04
  • 1
    levenshtein won't find much similarity between W and WEST. – H H Jun 13 '13 at 22:04
  • @HenkHolterman: I think that this is one column "Address". So there might be similarity between `8455 W 23RD st OFFICE 4` and `8455 WEST 23RD st OFC 4` although not much. – Tim Schmelter Jun 13 '13 at 22:05
  • Ok, so you've got a set of substitution rules. Now you need a strategic approach. Already wrote something? – H H Jun 13 '13 at 23:14

3 Answers3

1

There are a number of options, if you have SSIS you could use their built in fuzzy matching which is pretty adequate.

I've used a tool from Ambient Concepts: Address Parser It's a collection of functions that will parse and standardize your addresses. They appear to offer a free-trial so you can test it out.

You could also do it from scratch, homogenizing data as best you can. We often separate street number and street name, since the vast majority of differences come from the pre/post directions ie: 12 north 5th st, 12 n 5th st, 12 5th st north, etc. As well as the numbers vs number-words.

Hart CO
  • 34,064
  • 6
  • 48
  • 63
  • I don't have time to code all cases.I like Address Parser.There is also free source code here:- http://usaddress.codeplex.com. Thank you! – Habesha Jun 14 '13 at 20:51
1

You can't even approach success in straight SQL. It's difficult even in a "normal" programming language. There are regional addressing styles to be considered, and so many address variants that people come up with, such as

  • 123 Main St, Apt 3A, ...
  • 123 Main St #3, ...
  • 123 Main St Box 3 ...

All of which reflect the same physical address.

And that doesn't even take into account simple data entry errors such as transposed characters.

Things quickly get very difficult. You'll spend an inordinate amount of time and energy dealing with whatever the next creative data entry mistake somebody makes.

What you want to do is to use the US Post Office's address standardization services prior to recording the address in your database. There are also quite a few 3rd party vendors that the USPS has certified for accuracy.

I worked for a startup for a while that used the Google Maps API to standardize addresses — another approach.

Nicholas Carey
  • 71,308
  • 16
  • 93
  • 135
0

I'd go with something like...

  public static string Substitute(string s)
            {
                var abbrevs = new Dictionary<string, string>();
                abbrevs.Add("OFC", "OFFICE");
                abbrevs.Add("ST", "STREET");
                abbrevs.Add("ST.", "STREET");
                if (abbrevs.ContainsKey(s)) return abbrevs[s];
                return SubstituteWordNumbersForNumerics(s);            
            }

            public static string ToNormalAddressFormat(string address)
            {
                return address.Split(' ').ToList().Select(Substitute).Aggregate((x, y) => x + " " + y);
            }

Then convert all addresses before comparing

Keith Nicholas
  • 43,549
  • 15
  • 93
  • 156
  • I think you will change all street/place names based on "Saint Something" into "Street Something" – H H Jun 13 '13 at 22:29
  • ie, when ContainsKey has "Saint" it won't match to "ST" or "ST." – Keith Nicholas Jun 13 '13 at 22:33
  • What I meant was that "5 Saint Nicolas" and "5 st Nicolas" might be parts of the same address. How does your code deal with that? – H H Jun 13 '13 at 22:34
  • it won't if that's a valid abbreviation for Saint. It assumes no abbreviation is used to mean different things – Keith Nicholas Jun 13 '13 at 22:39
  • I don't think real-world addresses will honor that assumption. See my line about sort/long forms. Converting `"Saint Jonh street"` to `"st John st"` will probably work better. – H H Jun 13 '13 at 22:42
  • It could be done with computational matching if there weren't that many overlaps, otherwise I'd start trying to build an address parser which understands the structure of addresses. However, thinking about it, If, just for matching, you converted all forms Saint to St, then for matching, it would probably work out fine. – Keith Nicholas Jun 13 '13 at 22:44