0

I have a problem that I need help fixing. I am trying to create a script that crawls websites for mailing addresses. Mostly German addresses, but I am unsure of how to create said script, I have created one already that extracts email addresses from said websites. But the address one is puzzling because there isn't a real format.. Here is a couple German addresses for examples on a way to possibly extract this data.

Ilona Mustermann
Hauptstr. 76
27852 Musterheim


Andreas Mustermann
Schwarzwaldhochstraße 1
27812 Musterhausen


D. Mustermann
Kaiser-Wilhelm-Str.3
27852 Mustach

Those are just a few examples of what I am looking to extract from the websites. Is this possible to do with PHP?

Edit:

This is what I have so far

function extract_address($str) {
$str = strip_tags($str);
$Name = null;
$zcC = null;
$Street = null;

foreach(preg_split('/([^A-Za-z0-9üß\-\@\.\(\) .])+/', $str) as $token) {
    if(preg_match('/([A-Za-z\.])+ ([A-Za-z\.])+/', $token)){
        $Name = $token;
    }

    if(preg_match('/ /', $token)){
        $Street = $token;
    }

    if(preg_match('/[0-9]{5} [A-Za-zü]+/', $token)){
        $zcC = $token;
    }

    if(isset($Name) && isset($zcC) && isset($Street)){
        echo($Name."<br />".$Street."<br />".$zcC."<br /><br />");
        $Name = null;
        $Street = null;
        $zcC = null;
    }
    }
}

It works to retrieve $Name(IE: Ilona Mustermann and City/zipcode(27852 Musterheim) but unsure of a regex to always retrieve streets?


Well this is what I have came up with so far, and it seems to be working about 60% of the time on streets, zip/city work 100% and so does name. But when it tries to extract the street occasionally it fails.. Any idea why?

function extract_address($str) {
    $str = strip_tags($str);
    $Name = null;
    $zcC = null;
    $Street = null;

    foreach(preg_split('/([^A-Za-z0-9üß\-\@\.\(\)\& .])+/', $str) as $token) {
        if(preg_match('/([A-Za-z\&.])+ ([A-Za-z.])+/', $token) && !preg_match('/([A-Za-zß])+ ([0-9])+/', $token)){
            //echo("N:$token<br />");
            $Name = $token;
        }

        if(preg_match('/(\.)+/', $token) || preg_match('/(ß)+/', $token) || preg_match('/([A-Za-zß\.])+ ([0-9])+/', $token)){
            $Street = $token;
        }

        if(preg_match('/([0-9]){5} [A-Za-züß]+/', $token)){
            $zcC = $token;
        }

        /*echo("<br />
            N:$Name
            <br />
            S:$Street
            <br />
            Z:$zcC
            <br />
            ");*/

        if(isset($Name) && isset($zcC) && isset($Street)){
            echo($Name."<br />".$Street."<br />".$zcC."<br /><br />");
            $Name = null;
            $Street = null;
            $zcC = null;
        }
    }
}
ChrisF
  • 134,786
  • 31
  • 255
  • 325
Richard
  • 25
  • 1
  • 6
  • 1
    Not if you want a reliable result every time. – Anigel May 15 '13 at 08:14
  • The format is pretty much firstname lastname newline street newline zipcode city so you shouldnt have too many problems matching that with regex. also, check if the HTML is semantic enough to use a DOM Parser. – Gordon May 15 '13 at 08:23
  • I am new to using regex(because it's deprecated), I heard there are better alternatives but I couldn't find one. How would I be able to use regex efficiently to accomplish this goal? – Richard May 15 '13 at 08:47

3 Answers3

1

Of course it is possible you need to use preg_match() function. It is all about making a good regex pattern.

For example to get post-code

<?php
$str = "YOUR ADRESSES STRING HERE";
preg_match('/([0-9]+) ([A-Za-z]+)/', $str, $matches);
print_r($matches);

?>

this regex matches adresses you've given you need to put in it also your native characters.

 [A-Za-züß.]+ [A-Za-z.üß]+\s[A-Za-z. 0-9ß-]+\s[0-9]+ [A-Za-züß.]+
Robert
  • 19,800
  • 5
  • 55
  • 85
  • What if my address string is an entire websites contents(file_get_contents) will this still work? Also do I just replace '/([0-9]+) ([A-Za-z]+)/' with [A-Za-züß.]+ [A-Za-z.üß]+\s[A-Za-z. 0-9ß-]+\s[0-9]+ [A-Za-züß.]+? – Richard May 15 '13 at 08:43
  • So I guess there is no good way of accomplishing this task...? – Richard May 15 '13 at 16:25
  • it does not matter what it is. file_get_contents() gets string and preg_match works on strings that matters. To get website I'd suggest you to use curl instead of file_get_contents() – Robert May 15 '13 at 20:58
  • Not sure if you seen my edit to my original question or not but I have it somewhat working, just cannot get streets correctly... – Richard May 15 '13 at 21:04
  • so give a content of site where you have these streets and we can correct regex – Robert May 16 '13 at 05:51
  • www.kodeo.de http://www.robert-schuman-realschule.com www.kultur-ganz-oben.de http://downtown-achern.de There are about 3000 or so urls like this it is located on their Impressum page. – Richard May 16 '13 at 10:10
1

It's impossible to get a reliable answer with regex with such a complicated string. That's the only correct answer to this question.

Vlad
  • 795
  • 1
  • 12
  • 35
0

Vlad Bondarenko is right.

In CS speak: Postal addresses do not form a regular language.

Extracting information is an active research topic. Regular expressions are not completely bogus, but will have a higher failure rate than approaches that use dictionaries ("gazetteers") or more advanced machine learning algorithms.

A nice stack overflow q/a is How to parse freeform street/postal address out of text, and into components

Community
  • 1
  • 1
mvw
  • 5,075
  • 1
  • 28
  • 34