How can I extract address from raw text using NLTK in python?

Question

I have this text

'''Hi, Mr. Sam D. Richards lives here, 44 West 22nd Street, New York, NY 12345. Can you contact him now? If you need any help, call me on 12345678'''

. How the address part can be extracted from the above text using NLTK? I have tried Stanford NER Tagger, which gives me only New York as Location. How to solve this?

Most people would give regular [expressions](https://docs.python.org/2/howto/regex.html) a try. Besides that, a short search on SO will give you plenty of [inspiration](http://stackoverflow.com/questions/14087116/extract-address-from-string). — patrick, Jun 10 '16 at 21:22
here's a pretty solid [python, nltk write up](https://medium.com/@acrosson/extracting-names-emails-and-phone-numbers-5d576354baa). i'll type it into an answer here with the summary after i implement it myself. — tim-phillips, Jun 13 '17 at 21:30

Alex · Accepted Answer · 2016-06-13T08:27:14.863

15

Definitely regular expressions :)

Something like

import re

txt = ...
regexp = "[0-9]{1,3} .+, .+, [A-Z]{2} [0-9]{5}"
address = re.findall(regexp, txt)

# address = ['44 West 22nd Street, New York, NY 12345']

Explanation:

[0-9]{1,3}: 1 to 3 digits, the address number

(space): a space between the number and the street name

.+: street name, any character for any number of occurrences

,: a comma and a space before the city

.+: city, any character for any number of occurrences

,: a comma and a space before the state

[A-Z]{2}: exactly 2 uppercase chars from A to Z

[0-9]{5}: 5 digits

re.findall(expr, string) will return an array with all the occurrences found.

edited Jun 13 '16 at 08:27

answered Jun 13 '16 at 08:21

Alex

6,849
6
19
36

1

Deep clear explanation. Where I can learn this regular expressions with details – Ragu Natarajan Dec 24 '18 at 11:02
Is there any way to detect the address from text like this using Node.js, not python @Alex – Lakshmi Oct 07 '21 at 10:53
@Lakshmi it's the very same approach, just copy over the RegExp – Alex Oct 09 '21 at 17:10
i always find regex101.com very helpful – Joey Baruch Jan 06 '22 at 20:36

score 6 · Answer 2 · answered Oct 11 '18 at 05:47

6

Pyap works best not just for this particular example but also for other addresses contained in texts.

text = ...
addresses = pyap.parse(text, country='US')

answered Oct 11 '18 at 05:47

Bhio

61
1
3

For those finding this, as of mid-2022 this package hasnt been updated in a 2 years. Its a regex based approach and has the corresponding limitations. – Kyle Jul 15 '22 at 13:03
That said if you just want the regex logic:, heres a link to the US address regex logic: https://github.com/vladimarius/pyap/blob/master/pyap/source_US/data.py – Kyle Jul 15 '22 at 13:14

jujule · Answer 3 · 2020-05-28T15:40:02.860

3

Checkout libpostal, a library dedicated to address extraction

It cannot extract address from raw text but may help in related tasks

edited May 28 '20 at 15:40

answered Dec 14 '18 at 00:51

jujule

11,125
3
42
63

Libpostal is used for normalising strings that have already been identified as addresses, which is a completely different task. – Boris Jul 16 '20 at 10:05
1

Yeah, libpostal is not really a solution for OPs question. It takes human-formatted addresses and makes them more "machine" readable. For extraction, check out LexNLP. It's not well documented but with a few dozen lines of code it does a damn good job. Where something like libpostal could help is by finding and correcting mistakes or adding missing data like postal codes. For this though, it's easy enough to use Google's Address Validation API which works extremely well. Where libpostal shines though is in its license. Google doesn't let you store returned data for example. – Joel Mellon Feb 16 '23 at 19:36

score 2 · Answer 4 · answered Jul 15 '22 at 15:33

For US address extraction from bulk text:

For US addresses in bulks of text I have pretty good luck, though not perfect with the below regex. It wont work on many of the oddity type addresses and only captures first 5 of the zip.

Explanation:

([0-9]{1,6}) - string of 1-5 digits to start off
(.{5,75}) - Any character 5-75 times. I looked at the addresses I was interested in and the vast vast majority were over 5 and under 60 characters for the address line 1, address 2 and city.
(BIG LIST OF AMERICAN STATS AND ABBERVIATIONS) - This is to match on states. Assumes state names will be Title Case.
.{1,2} - designed to accomodate many permutations of ,/s or just /s between the state and the zip
([0-9]{5}) - captures first 5 of the zip.


text = "is an individual maintaining a residence at 175 Fox Meadow, Orchard Park, NY 14127. 2. other,"

address_regex = r"([0-9]{1,5})(.{5,75})((?:Ala(?:(?:bam|sk)a)|American Samoa|Arizona|Arkansas|(?:^(?!Baja )California)|Colorado|Connecticut|Delaware|District of Columbia|Florida|Georgia|Guam|Hawaii|Idaho|Illinois|Indiana|Iowa|Kansas|Kentucky|Louisiana|Maine|Maryland|Massachusetts|Michigan|Minnesota|Miss(?:(?:issipp|our)i)|Montana|Nebraska|Nevada|New (?:Hampshire|Jersey|Mexico|York)|North (?:(?:Carolin|Dakot)a)|Ohio|Oklahoma|Oregon|Pennsylvania|Puerto Rico|Rhode Island|South (?:(?:Carolin|Dakot)a)|Tennessee|Texas|Utah|Vermont|Virgin(?:ia| Island(s?))|Washington|West Virginia|Wisconsin|Wyoming|A[KLRSZ]|C[AOT]|D[CE]|FL|G[AU]|HI|I[ADLN]|K[SY]|LA|M[ADEINOST]|N[CDEHJMVY]|O[HKR]|P[AR]|RI|S[CD]|T[NX]|UT|V[AIT]|W[AIVY])).{1,2}([0-9]{5})"

addresses = re.findall(address_regex, text)

addresses is then: [('175', ' Fox Meadow, Orchard Park, ', 'NY', '', '14127')]

You can combine these and remove spaces like so:

for address in addresses:
    out_address = " ".join(address)
    out_address = " ".join(out_address.split())

To then break this into a proper line 1, line 2 etc. I suggest using an address validation API like Google or Lob. These can take a string and break it into parts. There are also some python solutions for this like usaddress

How can I extract address from raw text using NLTK in python?

4 Answers4

For US address extraction from bulk text:

Linked