Using natural language processing to extract an address from a tweet

Question

I'm building a twitter bot that will listen for tweets like the following:

Hey @twitterbot, I'm looking for restaurants around 123 Main Street, New York

or, another example:

@twitterbot, what's near Yonge & Dundas, Toronto? I'm hungry!

It'll then reply with the kind of data you'd expect these questions to return. I've got most of the problem solved, but I'm stuck on something that shouldn't be so hard; extracting the address from the tweet.

I'll be forwarding the address to a geocoding service to get lat/lng, so I don't need to format or prepare the address in any way; I just need to isolate it from unrelated text like "I'm looking for restaurants around" or "I'm hungry!".

Are there any NLP tools that will perform this address-identification within a block of text? Any suggestions for another way to go about it? Because Google's geocoder handles such a wide array of address formats (even a point of interest like 'The eaton centre, Toronto' counts as an address), I can't use regex to pluck the address out.

Phrased another way, I just want to remove any text that is not part of an address.

The addresses I'm looking for need to work for US/Canada.

There are some similar questions on StackOverflow but none that tackle this exact problem that I could find. Because Google's geocoder is so forgiving, the solution doesn't have to be perfect, it just needs to get rid of enough of the fuzz so that Google knows what I'm trying to say.

I'm very new to NLP so I'd appreciate any guidance on the subject.

score 6 · Accepted Answer · edited Jul 27 '20 at 10:29

How to parse freeform street/postal address out of text, and into components answers the question "Is there a way to isolate an address from the text around it and break it into pieces?" -- which is essentially the same question as yours (except that you don't care about breaking it into pieces -- just isolating it from the rest of the text).

SmartyStreets also has a nice demo at https://smartystreets.com/demo?mode=extract , but not a free solution unfortunately.

Another quick thought -- Since twitter posts are limited to 140 characters, and tend to contain few words (your two examples have 9 and 12 words, respectively), you could conceivably just brute-force it. For example, to get the location in "@twitterbot, what's near Yonge & Dundas, Toronto? I'm hungry!", you could send all of the following to the google geocoder --

what's near Yonge & Dundas, Toronto? I'm hungry!

what's near Yonge & Dundas, Toronto? I'm

what's near Yonge & Dundas, Toronto?

what's near Yonge & Dundas,

etc. for all possible substrings composed of complete words.

Thanks Gabriel! You've given me a few ideas. Hadn't thought of a brute-force approach but given how limited tweets are in length it's totally feasible! — Joshua Comeau, Jul 13 '15 at 15:41

Ervin Ruci · Answer 2 · 2015-12-12T18:08:40.350

3

Here you go: http://geocoder.ca/?locate=Hey+%40twitterbot%2C+I%27m+looking+for+restaurants+around+123+Main+Street%2C+New+York&geoit=xml&parse=1

<geodata>
<latt>40.5119365</latt>
<longt>-74.2493562</longt>
<AreaCode>347,718</AreaCode>
<TimeZone>America/New_York</TimeZone>
<standard>
     <stnumber>123</stnumber>
     <staddress>Main ST</staddress>
     <city>STATEN ISLAND</city>
     <prov>NY</prov>
     <postal>11385</postal>
     <confidence>0.9</confidence>
  </standard>
</geodata>

or http://geocoder.ca/?locate=Hey+%40twitterbot%2C+I%27m+looking+for+restaurants+around+123+Main+Street%2C+New+York

edited Dec 12 '15 at 18:08

answered Dec 12 '15 at 17:55

Ervin Ruci

829
6
10

Thanks for posting this. It's a great tool in theory, especially for a free/very cheap tool. Unfortunately it breaks quite easily. It does have a confidence score at least. It's common for a phone number to be near an address on a webpage for example, and this API almost always uses a segment of the phone number as the street number, for example this text I copied from a contact info card on Yelp: http://geocoder.ca/?locate=Business%20website%20http://www.joespizzanyc.com%20Phone%20number%20(212)%20366-1182%20Get%20Directions%207%20Carmine%20St%20New%20York,%20NY%2010014&geoit=xml&parse=1 – Joel Mellon Feb 15 '23 at 18:24
I got: 40.729519 -74.005138 212,917,646 America/New_York 1182 Carmine St New York NY 10014 0.4 from https://geocoder.ca/?locate=Business%20website%20http://www.joespizzanyc.com%20Phone%20number%20(212)%20366-1182%20Get%20Directions%207%20Carmine%20St%20New%20York,%20NY%2010014&geoit=xml&parse=1 – Ervin Ruci Apr 21 '23 at 01:35
Yeah, since it’s non-deterministic, results may vary™ – Joel Mellon Apr 22 '23 at 04:39

Using natural language processing to extract an address from a tweet

2 Answers2