I have a problem that I need help fixing. I am trying to create a script that crawls websites for mailing addresses. Mostly German addresses, but I am unsure of how to create said script, I have created one already that extracts email addresses from said websites. But the address one is puzzling because there isn't a real format.. Here is a couple German addresses for examples on a way to possibly extract this data.
Ilona Mustermann
Hauptstr. 76
27852 Musterheim
Andreas Mustermann
Schwarzwaldhochstraße 1
27812 Musterhausen
D. Mustermann
Kaiser-Wilhelm-Str.3
27852 Mustach
Those are just a few examples of what I am looking to extract from the websites. Is this possible to do with PHP?
Edit:
This is what I have so far
function extract_address($str) {
$str = strip_tags($str);
$Name = null;
$zcC = null;
$Street = null;
foreach(preg_split('/([^A-Za-z0-9üß\-\@\.\(\) .])+/', $str) as $token) {
if(preg_match('/([A-Za-z\.])+ ([A-Za-z\.])+/', $token)){
$Name = $token;
}
if(preg_match('/ /', $token)){
$Street = $token;
}
if(preg_match('/[0-9]{5} [A-Za-zü]+/', $token)){
$zcC = $token;
}
if(isset($Name) && isset($zcC) && isset($Street)){
echo($Name."<br />".$Street."<br />".$zcC."<br /><br />");
$Name = null;
$Street = null;
$zcC = null;
}
}
}
It works to retrieve $Name(IE: Ilona Mustermann and City/zipcode(27852 Musterheim) but unsure of a regex to always retrieve streets?
Well this is what I have came up with so far, and it seems to be working about 60% of the time on streets, zip/city work 100% and so does name. But when it tries to extract the street occasionally it fails.. Any idea why?
function extract_address($str) {
$str = strip_tags($str);
$Name = null;
$zcC = null;
$Street = null;
foreach(preg_split('/([^A-Za-z0-9üß\-\@\.\(\)\& .])+/', $str) as $token) {
if(preg_match('/([A-Za-z\&.])+ ([A-Za-z.])+/', $token) && !preg_match('/([A-Za-zß])+ ([0-9])+/', $token)){
//echo("N:$token<br />");
$Name = $token;
}
if(preg_match('/(\.)+/', $token) || preg_match('/(ß)+/', $token) || preg_match('/([A-Za-zß\.])+ ([0-9])+/', $token)){
$Street = $token;
}
if(preg_match('/([0-9]){5} [A-Za-züß]+/', $token)){
$zcC = $token;
}
/*echo("<br />
N:$Name
<br />
S:$Street
<br />
Z:$zcC
<br />
");*/
if(isset($Name) && isset($zcC) && isset($Street)){
echo($Name."<br />".$Street."<br />".$zcC."<br /><br />");
$Name = null;
$Street = null;
$zcC = null;
}
}
}