For what it's worth, I am having some pretty good success for Japanese addresses using the Cabocha Japanese Dependency Structure Analyser:
$ curl http://ir.yahoo.co.jp/jp/company/profile.html | cabocha -f 3 -n 2 | grep -B 5 -A 15 B-LOCATION
<tok id="89" feature="名詞,サ変接続,*,*,*,*,*" ne="I-ARTIFACT">=</tok>
</chunk>
<chunk id="6" link="-1" rel="D" score="0.000000" head="107" func="107">
<tok id="90" feature="名詞,数,*,*,*,*,*" ne="I-ARTIFACT">18</tok>
<tok id="91" feature="名詞,サ変接続,*,*,*,*,*" ne="I-ARTIFACT">"></tok>
<tok id="92" feature="名詞,固有名詞,地域,一般,*,*,東京,トウキョウ,トーキョー" ne="B-LOCATION">東京</tok>
<tok id="93" feature="名詞,接尾,地域,*,*,*,都,ト,ト" ne="I-LOCATION">都</tok>
<tok id="94" feature="名詞,固有名詞,地域,一般,*,*,港,ミナト,ミナト" ne="I-LOCATION">港</tok>
<tok id="95" feature="名詞,接尾,地域,*,*,*,区,ク,ク" ne="I-LOCATION">区</tok>
<tok id="96" feature="名詞,固有名詞,地域,一般,*,*,赤坂,アカサカ,アカサカ" ne="I-LOCATION">赤坂</tok>
<tok id="97" feature="名詞,数,*,*,*,*,*" ne="B-ARTIFACT">9</tok>
<tok id="98" feature="名詞,サ変接続,*,*,*,*,*" ne="I-ARTIFACT">-</tok>
<tok id="99" feature="名詞,数,*,*,*,*,*" ne="I-ARTIFACT">7</tok>
<tok id="100" feature="名詞,サ変接続,*,*,*,*,*" ne="I-ARTIFACT">-</tok>
<tok id="101" feature="名詞,数,*,*,*,*,*" ne="I-ARTIFACT">1</tok>
<tok id="102" feature="名詞,一般,*,*,*,*,*" ne="I-ARTIFACT">ミッドタウン・タワー</tok>
<tok id="103" feature="名詞,サ変接続,*,*,*,*,*" ne="I-ARTIFACT"></</tok>
<tok id="104" feature="名詞,一般,*,*,*,*,*" ne="I-ARTIFACT">a</tok>
<tok id="105" feature="名詞,サ変接続,*,*,*,*,*" ne="I-ARTIFACT">></</tok>
<tok id="106" feature="名詞,一般,*,*,*,*,*" ne="I-ARTIFACT">li</tok>
<tok id="107" feature="名詞,サ変接続,*,*,*,*,*" ne="I-ARTIFACT">></tok>
This pretty handily identifies the (start of the) address "東京都港区赤坂9-7-1 ミッドタウン・タワー" in the page. With a bit of post-processing logic one can use this to extract (Japanese) addresses, which are usually pretty uniform:
from subprocess import Popen, PIPE
import xml.etree.cElementTree as etree
def extract_addresses_from_plaintext(text):
"""Returns an array of what appear to be addresses from Japanese plaintext."""
return [''.join(address) for address in _extract_possible_addresses(_analyze_with_cabocha(text))]
def _analyze_with_cabocha(text, ne='2'):
p = Popen(['cabocha', '-f', '3', '-n', ne], stdin=PIPE, stdout=PIPE)
p.stdin.write(text.encode('utf-8'))
result = p.communicate()[0]
return '<sentences>%s</sentences>' % result
def _extract_possible_addresses(cabocha_xml):
sentences = etree.fromstring(cabocha_xml)
addresses = []
for sentence in sentences:
address = []
for chunk in sentence:
for tok in chunk:
features = _get_cabocha_features(tok)
if u'空白' in features:
continue
if (
tok.get('ne') in ['B-LOCATION', 'I-LOCATION'] or
(address and (
{u'数', u'サ変接続'} & features or
{u'記号', u'一般'} <= features or
{u'名詞', u'一般'} <= features
))
):
address.append(tok.text)
elif address:
addresses.append(address)
address = []
if address:
addresses.append(address)
return addresses
def _get_cabocha_features(tok):
return set([item for item in tok.get('feature').split(u',') if lambda feature: feature != '*'])
Coupling this with some BeautifulSoup pre-processing to get rid of the HTML first gives pretty decent results. I'll try applying similar techniques to English-based language classifiers which have been trained by a gazetteer or other good source to see how that goes.