16

I have this text

'''Hi, Mr. Sam D. Richards lives here, 44 West 22nd Street, New York, NY 12345. Can you contact him now? If you need any help, call me on 12345678'''

. How the address part can be extracted from the above text using NLTK? I have tried Stanford NER Tagger, which gives me only New York as Location. How to solve this?

Akshat Zala
  • 710
  • 1
  • 8
  • 23
ngrj
  • 337
  • 1
  • 3
  • 12
  • 3
    Most people would give regular [expressions](https://docs.python.org/2/howto/regex.html) a try. Besides that, a short search on SO will give you plenty of [inspiration](http://stackoverflow.com/questions/14087116/extract-address-from-string). – patrick Jun 10 '16 at 21:22
  • Thanks ! That gave me something to start with. – ngrj Jun 13 '16 at 10:56
  • Accept the answer please – Alex Jun 26 '16 at 11:29
  • patrick, that one's in php – tim-phillips Jun 13 '17 at 20:38
  • here's a pretty solid [python, nltk write up](https://medium.com/@acrosson/extracting-names-emails-and-phone-numbers-5d576354baa). i'll type it into an answer here with the summary after i implement it myself. – tim-phillips Jun 13 '17 at 21:30

4 Answers4

15

Definitely regular expressions :)

Something like

import re

txt = ...
regexp = "[0-9]{1,3} .+, .+, [A-Z]{2} [0-9]{5}"
address = re.findall(regexp, txt)

# address = ['44 West 22nd Street, New York, NY 12345']

Explanation:

[0-9]{1,3}: 1 to 3 digits, the address number

(space): a space between the number and the street name

.+: street name, any character for any number of occurrences

,: a comma and a space before the city

.+: city, any character for any number of occurrences

,: a comma and a space before the state

[A-Z]{2}: exactly 2 uppercase chars from A to Z

[0-9]{5}: 5 digits

re.findall(expr, string) will return an array with all the occurrences found.

Alex
  • 6,849
  • 6
  • 19
  • 36
6

Pyap works best not just for this particular example but also for other addresses contained in texts.

text = ...
addresses = pyap.parse(text, country='US')
Bhio
  • 61
  • 1
  • 3
  • For those finding this, as of mid-2022 this package hasnt been updated in a 2 years. Its a regex based approach and has the corresponding limitations. – Kyle Jul 15 '22 at 13:03
  • That said if you just want the regex logic:, heres a link to the US address regex logic: https://github.com/vladimarius/pyap/blob/master/pyap/source_US/data.py – Kyle Jul 15 '22 at 13:14
3

Checkout libpostal, a library dedicated to address extraction

It cannot extract address from raw text but may help in related tasks

jujule
  • 11,125
  • 3
  • 42
  • 63
  • Libpostal is used for normalising strings that have already been identified as addresses, which is a completely different task. – Boris Jul 16 '20 at 10:05
  • 1
    Yeah, libpostal is not really a solution for OPs question. It takes human-formatted addresses and makes them more "machine" readable. For extraction, check out LexNLP. It's not well documented but with a few dozen lines of code it does a damn good job. Where something like libpostal could help is by finding and correcting mistakes or adding missing data like postal codes. For this though, it's easy enough to use Google's Address Validation API which works extremely well. Where libpostal shines though is in its license. Google doesn't let you store returned data for example. – Joel Mellon Feb 16 '23 at 19:36
2

For US address extraction from bulk text:

For US addresses in bulks of text I have pretty good luck, though not perfect with the below regex. It wont work on many of the oddity type addresses and only captures first 5 of the zip.

Explanation:

  • ([0-9]{1,6}) - string of 1-5 digits to start off
  • (.{5,75}) - Any character 5-75 times. I looked at the addresses I was interested in and the vast vast majority were over 5 and under 60 characters for the address line 1, address 2 and city.
  • (BIG LIST OF AMERICAN STATS AND ABBERVIATIONS) - This is to match on states. Assumes state names will be Title Case.
  • .{1,2} - designed to accomodate many permutations of ,/s or just /s between the state and the zip
  • ([0-9]{5}) - captures first 5 of the zip.

text = "is an individual maintaining a residence at 175 Fox Meadow, Orchard Park, NY 14127. 2. other,"

address_regex = r"([0-9]{1,5})(.{5,75})((?:Ala(?:(?:bam|sk)a)|American Samoa|Arizona|Arkansas|(?:^(?!Baja )California)|Colorado|Connecticut|Delaware|District of Columbia|Florida|Georgia|Guam|Hawaii|Idaho|Illinois|Indiana|Iowa|Kansas|Kentucky|Louisiana|Maine|Maryland|Massachusetts|Michigan|Minnesota|Miss(?:(?:issipp|our)i)|Montana|Nebraska|Nevada|New (?:Hampshire|Jersey|Mexico|York)|North (?:(?:Carolin|Dakot)a)|Ohio|Oklahoma|Oregon|Pennsylvania|Puerto Rico|Rhode Island|South (?:(?:Carolin|Dakot)a)|Tennessee|Texas|Utah|Vermont|Virgin(?:ia| Island(s?))|Washington|West Virginia|Wisconsin|Wyoming|A[KLRSZ]|C[AOT]|D[CE]|FL|G[AU]|HI|I[ADLN]|K[SY]|LA|M[ADEINOST]|N[CDEHJMVY]|O[HKR]|P[AR]|RI|S[CD]|T[NX]|UT|V[AIT]|W[AIVY])).{1,2}([0-9]{5})"

addresses = re.findall(address_regex, text)

addresses is then: [('175', ' Fox Meadow, Orchard Park, ', 'NY', '', '14127')]

You can combine these and remove spaces like so:

for address in addresses:
    out_address = " ".join(address)
    out_address = " ".join(out_address.split())

To then break this into a proper line 1, line 2 etc. I suggest using an address validation API like Google or Lob. These can take a string and break it into parts. There are also some python solutions for this like usaddress

Kyle
  • 321
  • 1
  • 14