0

I'm trying to get the contact information from this site http://www.internic.net/registrars/registrar-967.html using PHP.. I was able to get the e-email ad by using the href links by doing this:

$contactStr = "http://www.internic.net/registrars/registrar-967.html";
                $contact_string = file_get_contents("$contactStr");
                preg_match_all('/<a href="(.*)">(.*)<\/a>/i', $contact_string, $contactInfo);
                $email = str_replace("mailto:", "", $contactInfo[1][6]); 

However, I'm having a hard time getting the address and the phone # since there's no html element I can use like < p > maybe.. I just need 1800 SW First Ave., Suite 440 Portland OR 97201 United States and 310-467-2549 from this site.. Please enlighten me on how to do this using preg_match_all or some other ways possible.. Thanks!

Andy Lester
  • 91,102
  • 13
  • 100
  • 152
Leah
  • 225
  • 2
  • 10
  • 24
  • [This](http://stackoverflow.com/questions/26947/how-to-implement-a-web-scraper-in-php) might be helpful. – David Jan 15 '13 at 01:45
  • Have a look at [DOMDocument](http://www.php.net/domdocument). – Ja͢ck Jan 15 '13 at 01:47
  • It is obligatory to also reference http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 surely the best answer of all time. – Devin Ceartas Jan 15 '13 at 02:36
  • **Don't use regular expressions to parse HTML**. You cannot reliably parse HTML with regular expressions. As soon as the HTML changes from your expectations, your code will be broken. See http://htmlparsing.com/php.html for examples of how to properly parse HTML with PHP modules. – Andy Lester Jan 15 '13 at 06:16

1 Answers1

0

Instead of using regex try DOMDocument as others have said in comment.

Here is an example (bit hacky tho) hope it helps:

function get_register_by_id($id){
    $site = file_get_contents('http://www.internic.net/registrars/registrar-'.$id.'.html');
    $dom = new DOMDocument();
    @$dom->loadHTML($site);
    $result = array();
    foreach($dom->getElementsByTagName('td') as $td) {
        if($td->getAttribute('width')=='420'){
            $innerHTML= '';
            $children = $td->childNodes;
            foreach ($children as $child) {
                $innerHTML .= trim($child->ownerDocument->saveXML($child));
            }
            $fixed = array_map('strip_tags', array_map('trim', explode("<br/>",trim($innerHTML))));
            foreach($fixed as $val){
                if(empty($val)){continue;}

                $result[] = str_replace(array('! '),'',$val);
            }
        }
    }
    return $result;
}


print_r(get_register_by_id(965));
/*Array
(
    [0] => Domain Central Australia Pty Ltd.
    [1] => Level 27
    [2] => 101 Collins Street
    [3] => Melbourne Victoria 3000
    [4] => Australia
    [5] => +64 300 4192
    [6] => robert.rolls@domaincentral.com.au
)*/
print_r(get_register_by_id(966));
/*
Array
(
    [0] => Web Business, LLC
    [1] => PO Box 1417
    [2] => Golden CO 80402
    [3] => United States
    [4] => +1.303.524.3469
    [5] => support@webbusiness.biz
)*/

print_r(get_register_by_id(967));
/*
Array
(
    [0] => #1 Host Australia, Inc.
    [1] => 1800 SW First Ave., Suite 440
    [2] => Portland OR 97201
    [3] => United States
    [4] => 310-467-2549
    [5] => registry-operations@moniker.com
)*/
Lawrence Cherone
  • 46,049
  • 7
  • 62
  • 106