2

I have a list of 1000 addresses loaded into oracle table.

The full address is in single column CompleteAddress Varchar(1000)

Sample Data:

12003 Main St New York NY 00991
123 ANYWHERE BLVD ABINGDON MD 21009

I need to split all the addresses into Street No + Street Name, City, State and Zip (sometimes zip5+zip4)

There are no commas or slashes in the data. How do I split addresses? I work in C# if this matters. Is RegEx the appropriate approach?

So far I tried to use SubString, but I think thats not going to work well.

string zipcode = completeAddress.Substring(completeAddress.Length - 5, 5);
string mystate = completeAddress.Substring(completeAddress.Length - 8, 2);

Any ideas?

Jerry
  • 70,495
  • 13
  • 100
  • 144
CoolArchTek
  • 3,729
  • 12
  • 47
  • 76
  • 1
    This is a more in depth question than you think, especially considering irregularities addresses (you're right, substring wouldnt work well) and varying formats between countries. See this question which leverages the Google Maps API: http://stackoverflow.com/questions/518210/where-is-a-good-address-parser – tnw Dec 26 '13 at 15:21
  • 1
    This is going to be difficult as you don't have a delimiter. Is there a pattern to the addresses - are they all US format? Are they all in New York? – geedubb Dec 26 '13 at 15:21
  • 5
    You probably don't want to hear this... but with 1000 rows only it'd quicker and SAFER to just run thru your addresses and manually add commas (,) to delimit each address component. After that use a simple Split function – Vland Dec 26 '13 at 15:25
  • @VLand totally agree if this is a one time process. – geedubb Dec 26 '13 at 15:26
  • 1
    there is still a pattern to extract the zipcode and state code, but what is the logic that "Main St New" is not the street Name but "Main St" is and "York" is not the state name but "New York" is? – Victor Mukherjee Dec 26 '13 at 15:27
  • Addresses are all over US. Not just New York – CoolArchTek Dec 26 '13 at 15:28
  • @VLand Today I have 1000, the table gets loaded everyday. tomorrow maybe 10000. :) – CoolArchTek Dec 26 '13 at 15:30
  • @CoolArchTek Is there a way to add commas (or another unique delimiter) between the fields at the source of your data? There is no way to programmatically do this if your data is so free form – geedubb Dec 26 '13 at 15:32
  • @CoolArchTek forget my answer in that case, I hoped (for you) that the list was complete :P – Vland Dec 26 '13 at 17:48

3 Answers3

6

Addresses are complicated. Very complicated. They are highly irregular and subjective things. And logistics companies have spent billions over the course of decades trying to make sense of them.

Better to capitalize on what others have done than to try to re-invent it.

The data you have is actually pretty meaningful. It just doesn't "feel" very meaningful. Businesses like to have their address data split up into lots of little pieces, but why? What do all of those little pieces mean? Why do they need to be distinct from one another? The data you have is an "address". Keep it, but add to it. Make use of existing information to extrapolate more information.

Use a geocoding API (Google? Bing? Some other service? Prices, etc. will vary) to search on the data you have and bring back more strongly-typed data. Store that alongside what you have. For example, you have this:

12003 Main St New York NY 00991

So you make a request here:

http://maps.googleapis.com/maps/api/geocode/json?address=12003+Main+St+New+York+NY+00991&sensor=false

And you get back this:

{
   "results" : [
      {
         "address_components" : [
            {
               "long_name" : "D R Main Street",
               "short_name" : "D R Main Street",
               "types" : [ "point_of_interest", "establishment" ]
            },
            {
               "long_name" : "5",
               "short_name" : "5",
               "types" : [ "street_number" ]
            },
            {
               "long_name" : "West 31st Street",
               "short_name" : "W 31st St",
               "types" : [ "route" ]
            },
            {
               "long_name" : "Midtown",
               "short_name" : "Midtown",
               "types" : [ "neighborhood", "political" ]
            },
            {
               "long_name" : "Manhattan",
               "short_name" : "Manhattan",
               "types" : [ "sublocality", "political" ]
            },
            {
               "long_name" : "New York",
               "short_name" : "New York",
               "types" : [ "locality", "political" ]
            },
            {
               "long_name" : "New York",
               "short_name" : "New York",
               "types" : [ "administrative_area_level_2", "political" ]
            },
            {
               "long_name" : "New York",
               "short_name" : "NY",
               "types" : [ "administrative_area_level_1", "political" ]
            },
            {
               "long_name" : "United States",
               "short_name" : "US",
               "types" : [ "country", "political" ]
            },
            {
               "long_name" : "10001",
               "short_name" : "10001",
               "types" : [ "postal_code" ]
            },
            {
               "long_name" : "4414",
               "short_name" : "4414",
               "types" : []
            }
         ],
         "formatted_address" : "D R Main Street, 5 West 31st Street, New York, NY 10001, USA",
         "geometry" : {
            "location" : {
               "lat" : 40.7468529,
               "lng" : -73.9865046
            },
            "location_type" : "APPROXIMATE",
            "viewport" : {
               "northeast" : {
                  "lat" : 40.7482018802915,
                  "lng" : -73.98515561970851
               },
               "southwest" : {
                  "lat" : 40.7455039197085,
                  "lng" : -73.98785358029151
               }
            }
         },
         "partial_match" : true,
         "types" : [ "point_of_interest", "establishment" ]
      }
   ],
   "status" : "OK"
}

Now that is some meaningful-looking data. Maybe not the "units" of data that somebody in your company thought addresses were made of, but meaningful and useful. For any given address in your data, you can automate this.

Let users enter their address the way they know it. Store that subjective address as the user-entered version. Geocode it to get more structured data to store alongside it.

David
  • 208,112
  • 36
  • 198
  • 279
  • 2
    The Google Geocoding API has the following limits in place: 2,500 requests per 24 hour period. Google Maps API for Business customers have higher limits: 100,000 requests per 24 hour period. More details at https://developers.google.com/maps/documentation/geocoding/ – cilerler Dec 26 '13 at 15:43
3

Given the following address:

123 Street Woodbury TN 37190

After you get the zipcode using your substring code, either with 4 digits or 5

do a request to google maps geocoding api as such using only the zipcode

[http://maps.googleapis.com/maps/api/geocode/json?address=37190&sensor=true][1]

Google Geocoding API will return, the State and the city, compare & match it to your address string and remove it:

You're now left with:

123 Street

You split by space, you get the street number and the street name

Here you go !

using Newtonsoft.Json;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace ConsoleApplication1
{
    class Program
    {
        public class GoogleGeoCodeResponse
        {

            public string status { get; set; }
            public results[] results { get; set; }

        }

        public class results
        {
            public string formatted_address { get; set; }
            public geometry geometry { get; set; }
            public string[] types { get; set; }
            public address_component[] address_components { get; set; }
        }

        public class geometry
        {
            public string location_type { get; set; }
            public location location { get; set; }
        }

        public class location
        {
            public string lat { get; set; }
            public string lng { get; set; }
        }

        public class address_component
        {
            public string long_name { get; set; }
            public string short_name { get; set; }
            public string[] types { get; set; }
        }

        private class Address {
            public string StreetNumber { get; set; }
            public string StreeName { get; set; }
            public string City { get; set; }
            public string State { get; set; }
            public string ZipCode { get; set; }
        }



        private static Address ParseAddress(string addressStr) {
            var address = new Address();
            address.ZipCode = addressStr.Split(' ').Last();
            var googleStr = "http://maps.google.com/maps/api/geocode/json?sensor=false&address=" + address.ZipCode;
            var result = new System.Net.WebClient().DownloadString(googleStr);
            var resObj = JsonConvert.DeserializeObject<GoogleGeoCodeResponse>(result);

            address.City = resObj.results[0].address_components[1].long_name;
            address.State = resObj.results[0].address_components[2].short_name;
            addressStr = addressStr.Replace(" " + address.City, "").Replace(" " + address.State, "").Replace(" " + address.ZipCode, "");

            address.StreetNumber = addressStr.Split(' ')[0];
            address.StreeName = addressStr.Split(' ')[1];

            return address;
        }

        static void Main(string[] args)
        {

           var address = ParseAddress("123 Street Woodbury TN 37190");

        }
    }
}
Rami Sakr
  • 372
  • 1
  • 14
1

You said you had a list of 1000 addresses... but then you specified that the table grows every day so you can't just manually add commas to each address.

If you can't geocode them I think I would create other support tables/lists (US cities, US States) and try to isolate the components starting from the right.

  1. get the ZIP, use regex or similar
  2. get the state, match with State table
  3. get the city looking in Cities table, if not there "put on Hold and check manually"
  4. get the address, knowing that the first numeric part is a street number

I'd check the items that I can't find in my support tables and review them manually

Vland
  • 4,151
  • 2
  • 32
  • 43