1

I am writing a scraper and I have the following code:

        //Open link prepended with domain
        $link='http://www.domain.de/'.$link;
        $data=@file_get_contents($link);
        $regex='#<span id="bandinfo">(.+?)<br><img src=".*?"  title=".*?" alt=".*?" >&nbsp;(.+?)&nbsp;(.+?)<br>(.+?)<br><a href=".*?">Mail-Formular</a>&nbsp;<img onmouseover=".*?" onmouseout=".*?" onclick=".*?" style=".*?" src=".*?" alt=".*?">&nbsp;<br><a href="tracklink.php.*?>(.+?)</a></span>#';
        preg_match_all($regex,$data,$match2);
        foreach($match2[1] as $info) echo $info."<br/>";

As you can see, I need to select several things in the regexp. However, at the bottom when I echo it out, it always only gives the first thing selected.

I thought in the array there are all selected things then? I need to save them in variables, but do not know how to access them.

GEOCHET
  • 21,119
  • 15
  • 74
  • 98
Dennis Hackethal
  • 13,662
  • 12
  • 66
  • 115
  • 4
    [Obligatory link](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) - seriously though, trying to parse real-world HTML with what is definitely an over-simplified regex will not result in good things. [DOM](http://php.net/manual/en/book.dom.php)/[XPath](http://www.php.net/manual/en/class.domxpath.php) all the way... – DaveRandom Jun 09 '12 at 21:47
  • `var_dump($match2);` echoes what..? – Jeroen Jun 09 '12 at 21:47
  • With this particular regex, since you are matching an entire `` which crucially has an `id` attribute, you would want `preg_match()`, instead of `preg_match_all()` (although DOM is the *right* way, see above). It will work by the array structure will be much more complex for what should result in only a single match (unless there are multiple spans with the same id which would make the HTML invalid). – DaveRandom Jun 09 '12 at 21:51

2 Answers2

0

$match2[1] contains only one match. Try to use $match2

foreach($match2 as $info) echo $info."<br/>";
j0k
  • 22,600
  • 28
  • 79
  • 90
0

You should not us regex to parse html, heres a simple function ive put together that uses domDocument plus curl as its faster.

Example scrape:

Looking for all links a that have an onmouseout attribute with a value of return nd();:

<?php 
$link = 'http://www.bandliste.de/Bandliste/';
$data=curl_get($link, $link);
$info = DOMParse($data,'a','onmouseout','return nd();');
print_r($info);
/*
Array
(
    [0] => Array
        (
            [tag] => a
            [onmouseout] => return nd();
            [text] => Martin und Kiehm
        )

    [1] => Array
        (
            [tag] => a
            [onmouseout] => return nd();
            [text] => Blues For Three
        )

    [2] => Array
        (
            [tag] => a
            [onmouseout] => return nd();
            [text] => Phrase Applauders
        )
 ...

 ...
*/
?>

Or second example looking for a div with a class attribute called bandinfo:

<?php
$link = 'Bands/Falling_For_Beautiful/14469/';
$link='http://www.bandliste.de/'.$link;
$data=curl_get($link, $link);
$info = DOMParse($data,'div','class','bandinfo');
/*
Array
(
[0] => Array
(
[tag] => div
[class] => bandinfo
[text] => What? We are Falling For Beautiful and we make music. And basically  thats it. Sound? Rock. Indie. Alternative. Pop. Who? Adrianne (Vocals/Guitar) Nina (Guitar/Special Effects) Bianca (Bass) Marisa (Drums) When? Some of us started having a band in 2003  we played tons of gigs, covered tons of songs, started writing our own songs. In 2008 we decided to forget about that and founded FFB. So we started to write songs and arranged them. We made them sound simple and catchy focusing on lyrics. Our songs are about life.  Booking: Bianca Untertrifallerhttp://www.fallingforbeautiful.com
)

)
*/
?>

Or an image contained within a onclick in some javascript:

Get all img tags with onclicks

<?php
$img = DOMParse($data,'img','onclick');
//Then find the image we are looking for
function parse_img($array){
    foreach($array as $value){
        if(strstr($value['onclick'],"Band Foto")){
            preg_match('#window.open\(\'(.*?)\', \'Band Foto\'#',$value['onclick'],$match);
            return $match[1];
        }
    }
}
//echo parse_img($img); //bandfoto-14469.jpg
?>

The actual dom function:

<?php
function DOMParse($source,$tags,$attribute=null,$attributeValue=null){
    header('Content-Type: text/html; charset=utf-8');
    $return = array();
    $dom = new DOMDocument("1.0","UTF-8");
    @$dom->loadHTML($source);
    $dom->preserveWhiteSpace = false;

    foreach($dom->getElementsByTagName($tags) as $ret) {
        //No attribute to look for so return only nodeValue
        if($attribute==null){
            if(trim($ret->nodeValue)==''){continue;}
            $return[] = array('tag'=>$tags,'text'=>preg_replace('/\s+/', ' ',$ret->nodeValue));
        }else{
            //Attribute not null look for eg: src, href, class ect
            if(trim($ret->nodeValue)=='' && $ret->getAttribute($attribute)==''){continue;}

            //If we looking for specific value from an attribute containg an attibute value
            if($attributeValue!=null){
                if($ret->getAttribute($attribute)==$attributeValue){
                    $return[] = array('tag'=>$tags,$attribute=>$ret->getAttribute($attribute),'text'=>preg_replace('/\s+/', ' ',$ret->nodeValue));
                }
            }else{
                $return[] = array('tag'=>$tags,$attribute=>$ret->getAttribute($attribute),'text'=>preg_replace('/\s+/', ' ',$ret->nodeValue));
            }

        }
    }
    return $return;
}
?>

And the curl function:

<?php
function curl_get($url, $referer){
    //check curl is installed or revert back to file_get_contents
    $return = (function_exists('curl_init')) ? '' : false;
    if($return==false){return file_get_contents($url);}

    $curl = curl_init();
    $header[0] = "Accept: text/xml,application/xml,application/json,application/xhtml+xml,";
    $header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
    $header[] = "Cache-Control: max-age=0";
    $header[] = "Connection: keep-alive";
    $header[] = "Keep-Alive: 300";
    $header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
    $header[] = "Accept-Language: en-us,en;q=0.5";
    $header[] = "Pragma: ";

    curl_setopt($curl, CURLOPT_URL, $url);
    curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0 Firefox/5.0');
    curl_setopt($curl, CURLOPT_HTTPHEADER, $header);
    curl_setopt($curl, CURLOPT_HEADER, 0);
    curl_setopt($curl, CURLOPT_REFERER, $referer);
    curl_setopt($curl, CURLOPT_ENCODING, 'gzip,deflate');
    curl_setopt($curl, CURLOPT_AUTOREFERER, true);
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($curl, CURLOPT_TIMEOUT, 30);
    curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);

    $html = curl_exec($curl);
    curl_close($curl);
    return $html;
}
?>

Hope it helps.

Lawrence Cherone
  • 46,049
  • 7
  • 62
  • 106