3

Hey, I was wondering how I can find a Street Address in a string in Python/Ruby?

Perhaps by a regex?

Also, it's gonna be in the following format (US)

420 Fanboy Lane, Cupertino CA

Thanks!

Matt
  • 22,721
  • 17
  • 71
  • 112
Souleiman
  • 3,360
  • 4
  • 23
  • 21
  • 13
    Obligatory: http://xkcd.com/208/ – Mark Byers Dec 28 '10 at 00:50
  • You'd have to place some reasonable limits on what a "street address" is. How many numbers can it have? Does it have to have a proper ending (e.g. Rd, St, Ct)? How many words can it have before the ending (e.g. is 1337 Old Stack Overflow Questions Lane too long?) – Rafe Kettler Dec 28 '10 at 00:54
  • Haha, funny. Kinda what I want to do. – Souleiman Dec 28 '10 at 00:54
  • 2
    That would be quite the regular expression. Street addresses can be in _many_ formats. It's about as free-form as a text field gets. There are systems out there that come close at recognizing addresses in text (GMail, iPhone), but false negatives are common and false positives are downright amusing. So acceptable fault tolerance is going to be a big thing here. – David Dec 28 '10 at 00:54
  • @Rafe Kettler Umm, I guess the Limit would be 3 words long (Old Cutler Lane) and it can have upto 4 numbers (4280 Elizabeth Street) – Souleiman Dec 28 '10 at 00:55
  • darn you guys are beating me to the comments. – Souleiman Dec 28 '10 at 00:56
  • @Soule I'd recommend you make it up to 6 numbers. In the US, many addresses are in the tens of thousands (at least in certain areas). Also, what if someone lives in an apartment? – Rafe Kettler Dec 28 '10 at 00:56
  • It's only gonna be in that particular format stated above ;) Thanks for the help and sorry for tri-post – Souleiman Dec 28 '10 at 00:56
  • What my website does is lots of emails come in from various local organizations about events their hosting. What I want to do is extract the Address and basic event info from those emails(my gmail acct) and stick them on a website. I already have a system in place but that requires Colon delimited data which is inefficient and all the organizations are too lazy to follow the format. The Addresses are just basic addresses they can follow an address format. Thanks for you help. – Souleiman Dec 28 '10 at 01:00
  • 2
    We should petition the US Postal Service to replace street addresses with GUIDs. (Though I imagine 911 phone calls will become difficult...) – David Dec 28 '10 at 01:01

6 Answers6

5

Maybe you want to have a look at pypostal. pypostal are the official Python bindings to libpostal.

With the Examples from Mike Bethany i made this little Example:

from postal.parser import parse_address

addresses = [
    "420 Fanboy Lane, Cupertino CA 12345",
    "1829 William Tell Oveture, by Gioachino Rossini 88421",
    "114801 Western East Avenue Apt. B32, Funky Township CA 12345",
    "1 Infinite Loop, Cupertino CA 12345-1234",
    "420 time!",
]

for address in addresses:
    print parse_address(address)
    print "*" * 60

>     [(u'420', u'house_number'), (u'fanboy lane', u'road'), (u'cupertino', u'city'), (u'ca', u'state'), (u'12345', u'postcode')]
>     ************************************************************
>     [(u'1829', u'house_number'), (u'william tell', u'road'), (u'oveture by gioachino', u'house'), (u'rossini', u'road'), (u'88421',
> u'postcode')]
>     ************************************************************
>     [(u'114801', u'house_number'), (u'western east avenue apt.', u'road'), (u'b32', u'postcode'), (u'funky', u'road'), (u'township',
> u'city'), (u'ca', u'state'), (u'12345', u'postcode')]
>     ************************************************************
>     [(u'1', u'house_number'), (u'infinite loop', u'road'), (u'cupertino', u'city'), (u'ca', u'state'), (u'12345-1234',
> u'postcode')]
>     ************************************************************
>     [(u'420', u'house_number'), (u'time !', u'house')]
>     ************************************************************
2

Using your example this is what I came up with in Ruby (I edited it to include ZIP code and an optional +4 ZIP):

regex = Regexp.new(/^[0-9]* (.*), (.*) [a-zA-Z]{2} [0-9]{5}(-[0-9]{4})?$/)
addresses = ["420 Fanboy Lane, Cupertino CA 12345"]
addresses << "1829 William Tell Oveture, by Gioachino Rossini 88421"
addresses << "114801 Western East Avenue Apt. B32, Funky Township CA 12345"
addresses << "1 Infinite Loop, Cupertino CA 12345-1234"
addresses << "420 time!"

addresses.each do |address|
  print address
  if address.match(regex)
    puts " is an address"
  else
    puts " is not an address"
  end
end

# Outputs:
> 420 Fanboy Lane, Cupertino CA 12345 is an address  
> 1829 William Tell Oveture, by Gioachino Rossini 88421 is not an address  
> 114801 Western East Avenue Apt. B32, Funky Township CA 12345 is an address  
> 1 Infinite Loop, Cupertino CA 12345-1234 is an address  
> 420 time! is not an address  
SLaks
  • 868,454
  • 176
  • 1,908
  • 1,964
  • That code doesnt work for me... File "mini.py", line 1 regex = Regexp.new(/^[0-9]* (.*), (.*) [a-zA-Z]{2}$/) ^ SyntaxError: invalid syntax But thanks! – Souleiman Dec 28 '10 at 13:20
  • That's because it's Ruby, not Python. Sorry, thought that was obvious. –  Dec 28 '10 at 16:05
  • Oh yeah - I'm pretty stupid. Sorry! (I was like what is this code, so weird) Off to install Ruby. DreamHost supports it, right?) Python and ruby are similar right? – Souleiman Dec 28 '10 at 20:27
  • Fixed! I changed your RegEx a bit : myregex=Regexp.new(/[0-9]{1,4} (.*), (.*) [a-zA-Z]{2} [0-9]{5}/) It includes Zip! It gives me the whole address from a list of random stuff. But thanks for your code, it allowed me to make my own and fix my issue! Would you mind if i made my own answer with my own or is that to arrogent i can just nominate urs instead. Thanks! – Souleiman Dec 28 '10 at 22:10
1

Here's what I used:

(\d{1,10}( \w+){1,10}( ( \w+){1,10})?( \w+){1,10}[,.](( \w+){1,10}(,)? [A-Z]{2}( [0-9]{5})?)?) 

It's not perfect and doesn't match edge cases but it works for most regularly typed addresses and partial addresses.

It finds addresses in text such as

Hi! I'm at 12567 Some St. Fairfax, VA. Come get me!

some text 12567 Some St. is my home

something else 123 My Street Drive, Fairfax VA 22033

Hope this helps someone

Community
  • 1
  • 1
Matt Sich
  • 3,905
  • 1
  • 22
  • 26
0
\d{1,4}( \w+){1,3},( \w+){1,3} [A-Z]{2}

Not fully tested, but should work. Just use it with your favorite function from re (e.g. re.findall. Assumptions:

  1. A house number can be between 1 and 4 digits long
  2. 1-3 words follow a house number, and they're all separated by spaces
  3. City name is 1-3 words (needs to match Cupertino, Los Angeles, and San Luis Obispo)
Rafe Kettler
  • 75,757
  • 21
  • 156
  • 151
  • +1 for a good answer even though I think mine is better ;). Mostly because it's more flexible. Of course that means I might have more false positives too. I also went with lower and upper case because I assumed input could be incorrectly entered. –  Dec 28 '10 at 01:17
  • @Mike `\w` matches anything that looks like a word, so it'll match capitalized words and lowercase ones. – Rafe Kettler Dec 28 '10 at 01:18
  • Yeah, but you only match up to 3 words, mine'll match anything. That could be the problem with mine too though since it will by necessity match more. Oh, and I meant lower/upper for the state. Sorry for the poor communication. –  Dec 28 '10 at 01:22
  • 1
    Thanks! But: import re pat = re.compile('\d{1,4}( \w+){1,3},( \w+){1,3} [A-Z]{2}') print pat.findall("420 Fanboy Lane, Cupertino CA") results in [(' Lane', ' Cupertino')], is that what it's supposed to do? Thanks – Souleiman Dec 28 '10 at 12:40
  • @Soule try prepending an `r` to your regex string, otherwise Python will treat the escapes weirdly. You might also check out the docs for re – Rafe Kettler Dec 28 '10 at 15:36
0

Okay, Based on the very helpful Mike Bethany and Rafe Kettler responses ( thanks!) I get this REGEX works for python and ruby. /[0-9]{1,4} (.), (.) [a-zA-Z]{2} [0-9]{5}/

Ruby Code - Results in 12 Argonaut Lane, Lexington MA 02478

myregex=Regexp.new(/[0-9]{1,4} (.*), (.*) [a-zA-Z]{2} [0-9]{5}(-[0-9]{4})?/)

print "We're Having a pizza party at 12 Argonaut Lane, Lexington MA 02478 Come join the party!".match(myregex)

Python Code - doesnt work quite the same, but this is the base code.

import re
myregex = re.compile(r'/[0-9]{1,4} (.*), (.*) [a-zA-Z]{2} [0-9]{5}(-[0-9]{4})?/')
search = myregex.findall("We're Having a pizza party at 12 Argonaut Lane, Lexington MA 02478 Come join the party!")
Souleiman
  • 3,360
  • 4
  • 23
  • 21
  • Tack on `(-[0-9]{4})?` at the end and you'll get optional +4 ZIP's too. –  Dec 31 '10 at 18:35
  • Ahah! So that's how you make something optional - put it in parantheses and put a ? after it? Thanks so much! – Souleiman Jan 01 '11 at 12:47
0

As stated, addresses are very free-form. Rather than the REGEX approach how about a service that provides accurate, standardized address data? I work for SmartyStreets, where we provide an API that does this very thing. One simple GET request and you've got your address parsed for you. Try this python sample out (you'll need to start a trial):

https://github.com/smartystreets/smartystreets-python-sdk/blob/master/examples/us_street_single_address_example.py

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Michael Whatcott
  • 5,603
  • 6
  • 36
  • 50
  • That seems like a cool service; The regexes worked fine for my project at the time. I'm rewriting the service and using some newly-aquired knowledge the Google Maps API does this and more for free. :P However, the caps are a downside. – Souleiman Nov 23 '11 at 19:15
  • Good news, the results are now returned in Title Case (not ALL CAPS). I'll be updating that sample soon. – Michael Whatcott Nov 25 '11 at 05:07
  • BTW, be aware that the Google Maps API cannot guarantee that a given address is currently real and deliverable--the USPS is the authority here. – Michael Whatcott Nov 25 '11 at 05:14
  • Ahah, It seems we have a minor misunderstanding. I meant that the Caps - limits (no more than 2k queries i think) of the Google Maps API are a downside of it. I'm only needing to place the address marker on a map, so if the sender sends me a bogus address it's his issue. I'll try to keep your site in mind if I need verification some day! ;) – Souleiman Nov 25 '11 at 12:43
  • Tested your service with my European address and it says its in America lol – Paulo Botelho Mar 08 '17 at 22:10
  • That's curious, what's the address? @PauloBotelho – Michael Whatcott Mar 09 '17 at 03:20
  • nevermind it was my bad my default you only make US searches 0.o – Paulo Botelho Mar 10 '17 at 12:46
  • Correct. We have a separate international API: https://smartystreets.com/docs/cloud/international-street-api – Michael Whatcott Mar 10 '17 at 16:00