1

My task is to convert the non-atomic (Australian)street addresses into atomic which means current street data is stored as street number and street names together. Samples are:

24 George street        -----------> 24         |   George street    
55 park rd              -----------> 55         |   pard rd  
102a gordon road        -----------> 102a       |   gordon road
unit 5/46 addison ave   -----------> unit 5/46  |   addison ave 
flat 2-9/87 north avenue-----------> flat 2-9/87|   north avenue
suit 5 lvl2/55 prince hwy-------> suit 5 lvl2/55|   prince hwy
shop 5 Big Shopping Centre  ------> Rejected
Suit 2 Level 100          -------> Rejected

added data(the way the program should work):

Darling street ------------------> Rejected
City road -----------------------> rejected

the suggested code processed result:

Darling street ------------>   Darling     |    Street
City road   --------------->   City        |     road

actually in this case the code should not process the address and throw an exception.

What is the best way of splitting the addresses?

animuson
  • 53,861
  • 28
  • 137
  • 147
Dilshad Abduwali
  • 1,388
  • 7
  • 26
  • 47
  • why should this rejected: 'shop 5 Big Shopping Centre' ? – CSᵠ May 12 '13 at 15:47
  • @kaᵠ- because it is not a valid Australian address which post office ville return/reject the delivery, according to the specs of the task – Dilshad Abduwali May 12 '13 at 16:03
  • I got that, but programs don't actually *understand* data, you need to teach them somehow. What would make that be rejected? – CSᵠ May 12 '13 at 16:13
  • This is a fuzzy data matching requirement, and not simply solved with SQL on its own. We've had a modicum of success using a 3rd party tool like QAS (http://www.qas-experian.com.au/), I'm sure there are other similar tools out there. – Jeffrey Kemp May 13 '13 at 06:41
  • How do you handle ambiguities such as:24 George Pt (meaning George Point) when the city is Pt. Macquarie. So an address string such as: 24 George Pt Macquarie NSW? Is it George Point or is is Port Macquarie? – Jeffrey May 13 '13 at 17:50

2 Answers2

3

I assume you have already seen the answers to this question along the same lines.

@kaᵠ pointed out already that the program doesn't know anything about the data, thus it has no context. This will always be the case. So, with that in mind, the first thing you need to determine is what level of accuracy do you need? If you need 70% accuracy then you can do that with simple REGEX. (Is regex EVER really simple)?

If you need certainty that the addresses you extracted from the input are actually real and valid, you need a list or table to compare against. That data would come from a source like Australia Post (or USPS in the United States).

So, use your regex to extract "guesses" and then verify those against a master list and the ones that match are good. Without the master list, you can't be sure that you got it right or that you got it wrong.

I have actually been working on this exact same issue at SmartyStreets (except I only deal with US addreses) and have come up with a number of different solutions - different ways to determine the beginning and end of the address string, as well as how to deal with false positives, or primary numbers that look just like a postal code. You can go pure REGEX or you can also use tables containing the postal codes, states, and also street names. This enables you to get very close to being able to extract the atomic data with high accuracy.

Community
  • 1
  • 1
Jeffrey
  • 502
  • 2
  • 10
2
select
   addr,
   regexp_substr(addr, '^(.*?)\s\D+$', 1, 1, '', 1) street_number,
   regexp_substr(addr, '^.*?\s+(\D*?)\s*$', 1, 1, '', 1) street_name
from t1   
where -- don't show rejected
   regexp_like(addr, '\d.*\s(street|road|rd|ave|avenue|hwy)\s*$', 'i')  

fiddle

Egor Skriptunoff
  • 23,359
  • 2
  • 34
  • 64
  • it works well, but what if there is no number at all in the street data. I is now taking the street name as street number in case there is no street number was recorded, is there anyway of preventing it to process if there is no number? – Dilshad Abduwali May 12 '13 at 14:55
  • @DilshatAbduwalli - Please add the address you are talking about to your question and give desired output. – Egor Skriptunoff May 12 '13 at 14:58
  • @Skritunoff - The question was edited according to the comment above – Dilshad Abduwali May 12 '13 at 15:05
  • @Skritunoff- can you please also tell me what has been changed in the fixed version as it is really difficult for a regexp beginner to distinguish the difference, please – Dilshad Abduwali May 12 '13 at 15:20
  • @EgorSkriptunoff - suggestion - last line might be changed to `regexp_like(LOWER(addr), '\d.*\s(street|road|rd|ave|avenue|hwy)\s*$', 'i')` on the off-chance that someone might use UPPER or Mixed case when entering an address. YMMV. – Bob Jarvis - Слава Україні May 12 '13 at 15:20
  • @DilshatAbduwalli - Click on the time following the word `edited` under my answer to see the difference. – Egor Skriptunoff May 12 '13 at 15:34
  • @EgorSkriptunoff - my apologies - I didn't take the time to look at all the args. :-} Thanks. – Bob Jarvis - Слава Україні May 12 '13 at 15:34
  • @EgorSkritunoff- if you still see this, please can you give me some a few mor advice. How should the regexp be changed if PO Box 123 is regarded as valid address and do not need to be splited and it goes to direct streetname as whole. Also in Australia there are addresses which indicates the crossing corners like Cnr alice street & George road. Any clue please? – Dilshad Abduwali May 13 '13 at 08:47
  • @DilshatAbduwalli - use `where regexp_like(addr, 'PO\s*Box\s*\d+', 'i')` to select PO Boxes. About corners I didn't understand. There are four houses at every crossroad. Which one of them is meant by `cnr street1 & street2`? – Egor Skriptunoff May 13 '13 at 14:16