Using a corpus of samples to train an ML algorithm to then extract similar parts out of arbitrary text

Question

I'm trying to extract (postal) addresses out of arbitrary, unstructured text (mostly websites). My idea is to approach this using a (semi) supervised machine learning algorithm. I have a pretty large corpus of addresses I can use to train the algorithm. Once trained, I'd like to feed in arbitrary chunks of text and get anything that resembles an address within that text back out. I imagine the match would be based partly on a structural similarity and partly based on matching keywords (city names and such).

What I'm not quite sure on is to what extend this is already covered by existing libraries or how far I have to go myself. Would I have to break down a piece of text using natural language processing and then use a text similarity analysis? Or is there a simple technique or library that can handle this largely on its own?

I'm playing around with NLTK and scikit-learn in Python at the moment for this. I'm sure I can figure out a solution once I know the right keywords and techniques to look for, but I'm new to this field and would like a high level overview of how this problem can be approached best.

score 1 · Answer 1 · edited May 23 '17 at 11:57

1

Given the large amount of variation and noise you will find in your data, I doubt you will be given any easy solution here. Parsing postal addresses from free text is a difficult research question on its own; training a classifier on the results of a parsing-agent (human or machine-based) adds several levels of complexity.

If you are dealing with USA addresses, the answer to this previous question gives an overview of the most common parsing methods.

edited May 23 '17 at 11:57

Community

1
1

answered May 22 '14 at 19:57

emiguevara

1,359
13
26

Thank you for the response, I'll certainly look into this other answer in detail. In my case I'm mostly interested in *Japanese* addresses, which do tend to be pretty regular in many cases, give or take a bit of whitespace, and as I said I have a large corpus for training purposes available... – deceze May 22 '14 at 20:07
I see... still, I think that you will have to "massage" your examples quite a lot before training a learner. You need to extract the addresses and segment them into coherent parts (that is, basically, parsing them into) street, number, city, etc. With such a database you will be able to extract features for each instance in the corpus and try a learning algorithm on them. – emiguevara May 22 '14 at 20:34
I see. Through what I was reading I got the idea that I'd have to use a part-of-speech classifier which was trained with a gazetteer or database of tagged place names, and then I'd basically be looking for sentences with a lot of place entity mentions at once. Does that sound like a useful approach if I were to use this at all? – deceze May 23 '14 at 05:16
Yes, that sounds like a very sensible approach for some parts of it (like you say, place names, streets, etc. could be parsed with a gazeteer). Street and postal codes numbers would need another gazeteer or a regex-based recognizer. People and company names are different still... I don't get your comment about lots of of entity mentions in a sentence though... First of all, addresses could span beyond sentence boundaries. Second, you don't just need a lot of entities to make an address: you need a defined (and possible variable) set of them, in certain orders or combinations. – emiguevara May 23 '14 at 07:46
Yes, I guess that would be post-processing. I mean to zero in on something that looks like an address I'd first look for "clumps" of place-entity tagged parts of the text (as opposed to, say, single mentions of a city somewhere) and then try to post-process those to figure out what exactly is an address and what isn't. I guess I have to play around with this more, I just needed to see if I'm on the right track at all. Thanks. :) – deceze May 23 '14 at 07:52

deceze · Answer 2 · 2014-05-23T14:56:49.457

For what it's worth, I am having some pretty good success for Japanese addresses using the Cabocha Japanese Dependency Structure Analyser:

$ curl http://ir.yahoo.co.jp/jp/company/profile.html | cabocha -f 3 -n 2 | grep -B 5 -A 15 B-LOCATION
  <tok id="89" feature="名詞,サ変接続,*,*,*,*,*" ne="I-ARTIFACT">=</tok>
 </chunk>
 <chunk id="6" link="-1" rel="D" score="0.000000" head="107" func="107">
  <tok id="90" feature="名詞,数,*,*,*,*,*" ne="I-ARTIFACT">18</tok>
  <tok id="91" feature="名詞,サ変接続,*,*,*,*,*" ne="I-ARTIFACT">&quot;&gt;</tok>
  <tok id="92" feature="名詞,固有名詞,地域,一般,*,*,東京,トウキョウ,トーキョー" ne="B-LOCATION">東京</tok>
  <tok id="93" feature="名詞,接尾,地域,*,*,*,都,ト,ト" ne="I-LOCATION">都</tok>
  <tok id="94" feature="名詞,固有名詞,地域,一般,*,*,港,ミナト,ミナト" ne="I-LOCATION">港</tok>
  <tok id="95" feature="名詞,接尾,地域,*,*,*,区,ク,ク" ne="I-LOCATION">区</tok>
  <tok id="96" feature="名詞,固有名詞,地域,一般,*,*,赤坂,アカサカ,アカサカ" ne="I-LOCATION">赤坂</tok>
  <tok id="97" feature="名詞,数,*,*,*,*,*" ne="B-ARTIFACT">9</tok>
  <tok id="98" feature="名詞,サ変接続,*,*,*,*,*" ne="I-ARTIFACT">-</tok>
  <tok id="99" feature="名詞,数,*,*,*,*,*" ne="I-ARTIFACT">7</tok>
  <tok id="100" feature="名詞,サ変接続,*,*,*,*,*" ne="I-ARTIFACT">-</tok>
  <tok id="101" feature="名詞,数,*,*,*,*,*" ne="I-ARTIFACT">1</tok>
  <tok id="102" feature="名詞,一般,*,*,*,*,*" ne="I-ARTIFACT">ミッドタウン・タワー</tok>
  <tok id="103" feature="名詞,サ変接続,*,*,*,*,*" ne="I-ARTIFACT">&lt;/</tok>
  <tok id="104" feature="名詞,一般,*,*,*,*,*" ne="I-ARTIFACT">a</tok>
  <tok id="105" feature="名詞,サ変接続,*,*,*,*,*" ne="I-ARTIFACT">&gt;&lt;/</tok>
  <tok id="106" feature="名詞,一般,*,*,*,*,*" ne="I-ARTIFACT">li</tok>
  <tok id="107" feature="名詞,サ変接続,*,*,*,*,*" ne="I-ARTIFACT">&gt;</tok>

This pretty handily identifies the (start of the) address "東京都港区赤坂9-7-1 ミッドタウン・タワー" in the page. With a bit of post-processing logic one can use this to extract (Japanese) addresses, which are usually pretty uniform:

from subprocess import Popen, PIPE
import xml.etree.cElementTree as etree


def extract_addresses_from_plaintext(text):
    """Returns an array of what appear to be addresses from Japanese plaintext."""
    return [''.join(address) for address in _extract_possible_addresses(_analyze_with_cabocha(text))]


def _analyze_with_cabocha(text, ne='2'):
    p = Popen(['cabocha', '-f', '3', '-n', ne], stdin=PIPE, stdout=PIPE)
    p.stdin.write(text.encode('utf-8'))
    result = p.communicate()[0]
    return '<sentences>%s</sentences>' % result


def _extract_possible_addresses(cabocha_xml):
    sentences = etree.fromstring(cabocha_xml)
    addresses = []
    for sentence in sentences:
        address = []
        for chunk in sentence:
            for tok in chunk:
                features = _get_cabocha_features(tok)
                if u'空白' in features:
                    continue
                if (
                        tok.get('ne') in ['B-LOCATION', 'I-LOCATION'] or
                        (address and (
                            {u'数', u'サ変接続'} & features or
                            {u'記号', u'一般'} <= features or
                            {u'名詞', u'一般'} <= features
                        ))
                ):
                    address.append(tok.text)
                elif address:
                    addresses.append(address)
                    address = []
        if address:
            addresses.append(address)
    return addresses


def _get_cabocha_features(tok):
    return set([item for item in tok.get('feature').split(u',') if lambda feature: feature != '*'])

Coupling this with some BeautifulSoup pre-processing to get rid of the HTML first gives pretty decent results. I'll try applying similar techniques to English-based language classifiers which have been trained by a gazetteer or other good source to see how that goes.

Using a corpus of samples to train an ML algorithm to then extract similar parts out of arbitrary text

2 Answers2