2

Looking for a quick and dirty way to parse Australian street addresses into its parts:
3A/45 Jindabyne Rd, Oakleigh, VIC 3166

should split into:
"3A", 45, "Jindabyne Rd" "Oakleigh", "VIC", 3166

Suburb names can have multiple words, as can street names.


See: Parse A Steet Address into components

Has to be in Java, cannot make http requests (e.g. to web APIs).


EDIT: Assume that format specified is always followed. I have no issue with spitting incorrectly formatted strings back at the user with a message telling them to follow the format (which I've described above).

Community
  • 1
  • 1
bguiz
  • 27,371
  • 47
  • 154
  • 243
  • Do you expect to see: "Unit 3A, 45 ...", "Flat 3A, 45 ...", "Apartment 3A, 45 ...", and the simple "3A, 45 ..." – Mike Mar 01 '10 at 12:59

7 Answers7

8

Honestly, you're setting yourself a rather Sisyphean challenge here, and I'm not sure if it's worthwhile. Unless your data comes from a known source, with a very well specified format, you're going to get data that's completely useless. If you're dealing with free text, people screw up their addresses in ways you wouldn't believe.

Do you really want to try (yourself) to parse every possible combination of Richmond, Victoria, 3121 and Richmond 3121 VIC and Richmond VIC, 3121 etc? And that's just suburb granularity!

Addresses are even worse. Sure, most people would put 7/21 Smith St for a unit, or 29-33 Jones St for a location spanning multiple street numbers, but people aren't consistent. Is 1-5 Brown St unit 1 at number 5, or a location spanning #1 to #5 on that street? Is 7A a separate subdivided street address, or Unit A at #7?

Address matching is not a simple problem and if your data set is end-user-entered free text, I seriously wouldn't bother unless you have a trivial amount of data or don't care about accuracy that much (or, alternatively, have a lot of time for manual cleanups). If not, hand it off to a piece of software that does this work for you.

Australia Post have something called the Postal Address File (PAF) which contains every valid delivery location in Australia. There are a number of software libraries which will do the parsing + matching for you, and either give you a definitive answer (including all the individual address components, as you're after) or provide a list of potential matches for you to choose from if the address is non-existent or ambiguous. One example I'm aware of is QAS Batch (not affiliated with them in any way, evaluated their software in the past but didn't end up using it) but that's just one example; there's a list of others accessible through the PAF website.

Cannot recommend strongly enough that you don't waste your time on this unless it's at a trivial scale.

If it is, hey, yeah, regex.

Cowan
  • 37,227
  • 11
  • 66
  • 65
  • @Cowan, thanks for a well reasoned answer. However you may assume that the input string will conform to strict format. E.g. it will always be `Richmond, VIC 3121`, not any of the other formats. – bguiz Mar 02 '10 at 09:37
  • @Cowan is 100% correct. I do projects where address validation is required and we ALWAYS use an address cleaning service for all the reasons detailed here. FWIW, I have discovered a RELATIVELY inexpensive AU address option at http://www.addressify.com.au/. I do not work for them, but have used them. Much cheaper than the big players. – NullPumpkinException May 07 '17 at 23:41
  • disagree with this, there a simple techniques to get 99% of addresses, the rest you can manually fix. There is also G-NAF which is free which gets you on the right path, the difference between that and PAF is minimal for most applications... get it from data.gov.au for free. – Dawesi Apr 06 '21 at 05:45
3

Given your reply to my other answer, this should do for the strictly-formatted case you specify:

    String sample = "3A/45 Jindabyne Rd, Oakleigh, VIC 3166";
    Pattern pattern = Pattern.compile("(([^/ ]+)/)?([^ ]+) ([^,]+), ([^,]+), ([^ ]+) (\\d+)");
    Matcher m = pattern.matcher(sample);
    if (m.find()) {
        System.out.println("Unit: " + m.group(2));
        System.out.println("Number: " + m.group(3));
        System.out.println("Street: " + m.group(4));
        System.out.println("Suburb: " + m.group(5));
        System.out.println("State: " + m.group(6));
        System.out.println("Postcode: " + m.group(7));
    } else {
        throw new IllegalArgumentException("WTF");
    }

This works if you remove the '3A/' (in which case m.group(2) will be null), if the street number is '45A' or '45-47', if we add a space to the road ('Jindabyne East Rd') or to the suburb ('Oakleigh South').

Just to explain that regex further, if you're not familiar with regular expressions:

(([^/ ]+)/)? is the equivalent of just ([^/ ]+/)? -- that is, 'anything not including a forward slash or a space, followed by a slash'. The question mark makes it optional (so the whole clause can be missing), and the extra parentheses in the final version are to create a smaller inner group, without the slash, for later extraction.

([^ ]+) is 'capture anything that's not a space (which is followed by a space)' -- this is the street number.

([^,]+), is 'capture anything that's not a comma (which is followed by comma and space)' -- this is the street name. Anything is valid in the street name as long as it's not a comma.

([^,]+), is the same again, in this case to capture the suburb.

([^ ]+) captures the next non-space string (state abbrevation) and skips the space after it.

(\\d+) rounds off by capturing any number of digits (the postcode)

Hope that's helpful.

Cowan
  • 37,227
  • 11
  • 66
  • 65
2

Hm, probably quite difficult because the format is not well defined.

A regex would certainly work as a quick&dirty solution. The problem is that it will probably fail (produce incorrect results) in special cases.

Best bet is probably to hack up a small regex, then run that over a realistic dataset (ideally everything you have in production), and check if it gives good results. May be a lot of manual work, but probably the best you can do...

Edit: BTW, to use regexes in Java, use the methods from package java.util.regex. Just thought I'd mention it...

sleske
  • 81,358
  • 34
  • 189
  • 227
2

If anyone interested I wrote the following regex to parse Australia addresses.

r"(?i)(\b(PO BOX|post box)[,\s|.\s|,.|\s]*)?(\b(\d+))(\b(?:(?!\s{2,}).){1,60})\b(New South Wales|Victoria|Queensland|Western Australia|South Australia|Tasmania|VIC|NSW|ACT|QLD|NT|SA|TAS|WA).?[,\s|.\s|,.|\s]*(\b\d{4}).?[,\s|.\s|,.|\s]*(\b(Australia|Au))?")

And this one for parse Nex Zealand addresses.

r"(?i)(\b(PO BOX|post box)[,\s|.\s|,.|\s]*)?(\b(\d+))(\b(?:(?!\s{2,}).){1,60})\b(Northland|Auckland|Waikato|Bay of Plenty|Gisborne|Hawke's Bay|Taranaki|Manawatu-Whanganui|Wellington|Tasman|Nelson|Marlborough|West Coast|Canterbury|Otago|Southland).?[,\s|.\s|,.|\s]*(\b\d{4}).?[,\s|.\s|,.|\s]*(\b(New zealand|Newzealand|Nz))?")
Gihan Gamage
  • 2,944
  • 19
  • 27
1

I have created a regex which extracts the address components (e.g. unit number, street number, street name including the suburb, state and postcode) this works on Australian addresses but it can be easily customized for other addresses, the only thing to update for other addresses is the state part. https://regex101.com/library/5bj4wi

mrghofrani
  • 1,335
  • 2
  • 13
  • 32
Aran Dekar
  • 461
  • 5
  • 12
0

You could use String.split, first with ,, then with . or /.

Valentin Rocher
  • 11,667
  • 45
  • 59
0

For a commercial solution, you could give address-parser.com a try.

Mike Warner
  • 137
  • 3
  • 8