10

I'm using Java 6. I'm looking for an automated way to parse addresses. I'm not concerned if the addresses exist or not. The best thing I have found is JGeocoder (v 0.4.1), but JGeocoder is unable to parse addresses like this

16th Street Theater, Berwyn Cultural Center,  6420 16th St.

Does anyone know of a free Java address parser that is up to the challenge? By "parse" I mean the ability to distinguish street, city, state, postal code, and potentially the venue name (the above venue name is "16th Street Theater, Berwyn Cultural Center").

Matt
  • 22,721
  • 17
  • 71
  • 112
Dave
  • 15,639
  • 133
  • 442
  • 830
  • 7
    Good luck. This is a well-known _extremely_ difficult problem, mostly because of the infinite variety in address formatting. Having done a lot of this type of work back in the '80s I can guarantee that no perfect (or even 99% perfect) solution exists. You need different parsing rules for different countries, and even for regions within a country, and a large dictionary of exceptions. If you're limited to US address, the US Postal Service website may be of help. – Jim Garrison Apr 13 '12 at 19:35

3 Answers3

9

Update: This topic is more exhaustively covered in this StackOverflow question.


I work for SmartyStreets where we parse and process addresses, and we have an answer. This is what we call "SLAP" or Single-Line Address Parsing (or Processing). The formal term is Named Entity Recognition (NER).

I'm not an expert on Java libraries, but I do know that any in-house implementations will not live up to expectations. Here's some common reasons that people who I've helped have previously had difficulty:

  • Google / Yahoo! / Bing Maps web services do not allow automated queries and do not verify accuracy of the parsed address.

  • In-house code can make also only make a best guess without any knowledge of existent addresses (a database) or other sorts of official sources. I know you want a library that can do this in-house, but you can at best make a guess...

  • By the way, regular expressions are not the answer. The best regex I've seen to parse addresses was dynamically generated over hundreds of lines of code and several classes. It was a mess, and was only correct for types of addresses you'd expect, not all the valid (US) formats there actually are.

This is an incredibly complex task... unless you have the right tools. One of our services is called LiveAddress API, and it's similar to Google Maps in that it parses addresses and geocodes them, but goes a step further by being CASS-Certified and returning only valid addresses, almost no matter the input format.

I encourage you to do some research of your own, but this is probably the most effective and reliable method.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Matt
  • 22,721
  • 17
  • 71
  • 112
  • As I feared, this service isn't free. I'm not marking this as correct only b/c I did specify I was looking for something free. However, maybe you'll get some good promotion as people find this answer through Google. – Dave Apr 14 '12 at 18:04
  • Actually it is free; it only costs money if you choose a higher query limit than the default 250/mo. But of course, you should find something to meet your needs. Let us know what you decide if you do find something else! – Matt Apr 14 '12 at 19:11
3

https://code.google.com/p/usaddressparser/ Parses US address string and splits it into fields ( number, street, suite,city,zip etc.). Java jar and sources

0

If webservices are allowed, you could try google maps.

JohanB
  • 2,068
  • 1
  • 15
  • 15