4

I'm trying to identify and extract any input address location (Not limited to US - SmartyStreet) from a long string of text using php on my xampp.

I've read several topics/libraries regarding on how to do this, which revolves around using NLP, Google's Geocoding API and regex to perform the above mentioned task. These 3 links are some plausible link that may help Link 1, Link 2, Link 3/GitHub Library(Seems Promising).

However, I do not know whether these links may be of any help with the implementation? Can anyone help me with it?

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Vivian
  • 1,071
  • 5
  • 16
  • 29

1 Answers1

4

That is the holy grail of address parsing, for sure. A few things to consider when attacking this project. First, each country can have their own particular addressing format. As much as it would be nice, there's no standard addressing format.

Here are some good compilations of address formats, but even these don't always agree:

Address formats by Informatica

Address formats by Universal Postal Union

Address formats by a guy who has spent a lot of time thinking about this kind of stuff

Step 1 - Once you have become familiar with all the possible address formats for each country, you can group the formats that are similar and create a regex for each group.

Step 2 - This is critical. Do everything you can to determine the country that the address might pertain to. This will let you know which regex to utilize. If you can't do this, you may end up with many different address candidates.

Step 3 - Using your regex, scan through the source text to determine potential horizons, start and end points for an address. In the USA, addresses typically begin with a house number and end with a zipcode (5 or 9 or eleven digit). In Germany addresses typically begin with a street name and end with a city/state or postal code.

Step 4 - Now scan through that address candidate to determine the various components of the address, based on your understanding of the formatting pattern for that country. Find the following components:

  • primary number
  • street pre-directional (helps to have an index of all the possible values)
  • street name (helps to have an index of all the possible values)
  • street suffix (helps to have an index of all the possible values)
  • street post-directional (helps to have an index of all the possible values)
  • secondary number designator (helps to have an index of all the possible values)
  • secondary number
  • city (helps to have an index of all the possible values)
  • state (helps to have an index of all the possible values)
  • postal code

(there are a lot more, but that's a good start)

Step 5 - If you only want to determine a string that looks like an address, you're done. Feed this string into a geocoding tool and get the lat/lon that corresponds to it. Google Maps or OpenStreetMap should be able to do the trick for you.

If you want to know if an address is actually valid (as in matches a known entry in an authoritative dataset, like the local post office) then you'll need to use an address validation tool, like one that you'll find with a simple google search:

Google Search: "address validation"

Full disclosure: I spend a lot of time thinking about this very topic, trying to find different ways to solve it, and explaining it to a lot of people. I work international addresses all day long at SmartyStreets.

Jeffrey
  • 502
  • 2
  • 10