3

I worked out various regex to scrape the data.

Here I can scrape image from the page source:

Here I scraped data from table td

    <?php

    $s = file_get_contents('http://www.altassets.net/altassets-events'); 
    $matches = array(); 
    preg_match_all("/<tr>(.*)<\/tr>/sU", $s, $matches); 
    $trs = $matches[1]; $td_matches = array(); 
    foreach ($trs as $tr) { $tdmatch = array(); 
    preg_match_all("/<td>(.*)<\/td>/sU", $tr, $tdmatch); 
    $td_matches[] = $tdmatch[1]; } var_dump($td_matches); 
    //print_r($td_matches); 
?>

similarly image and titles too.

But how to scrape data from <p> tag with specific class name?

<p class="review_comment ieSucks" itemprop="description" lang="en"> Some text </p>

Consider this page,

http://www.yelp.com/biz/fontanas-italian-restaurant-cupertino

this is just example, just want to know procedure. class name and tag name can be changed

I want to scrape review and it's Rate value from the page

Adam Hopkinson
  • 28,281
  • 7
  • 65
  • 99

3 Answers3

1

Don't use Regular expressions. Implement PHP native DOMXPath or DOMDocument Class..

foreach($dom->getElementsByTagName('p') as $ptag)
{
    if($ptag->getAttribute('class')=="review_comment ieSucks")
    {
        echo $ptag->nodeValue; //"prints" Some text
    }
}

Loop through all the paragraph tags and see if there is match found on attribute, if found, you could just print the node's value.

Working Demo

Using file_get_contents()

<?php
libxml_use_internal_errors(true);
$html=file_get_contents('http://www.yelp.com/biz/fontanas-italian-restaurant-cupertino');
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach($dom->getElementsByTagName('p') as $ptag)
{
    if($ptag->getAttribute('class')=="review_comment ieSucks")
    {
        echo "<h6>".$ptag->nodeValue."</h6>";
    }
}
Shankar Narayana Damodaran
  • 68,075
  • 43
  • 96
  • 126
  • I am not sure, but I read there Dom is pron to giver error in some case compare to regex, so I avoid it. Do you have anyidea about this? can you please give hint for my give link using Dom @Shankar –  Mar 26 '14 at 09:24
  • I doubt. If I scrap page source using 'file_get-content' and then apply above logic then would it be enough? can you please tell me about getting rate value for that review? –  Mar 26 '14 at 09:26
  • 1
    Sure it does work when you pass the source ! Also, see this why you should not parse HTML with regex. http://stackoverflow.com/questions/6751105/why-its-not-possible-to-use-regex-to-parse-html-xml-a-formal-explanation-in-la – Shankar Narayana Damodaran Mar 26 '14 at 09:30
  • IS there any issue with using `file_get_content`? it give source code,then I use regex to remove the scripts. Is this bad practice? I appreciate if you can show me model example for how to scrap data from page using standard practice. –  Mar 26 '14 at 09:47
  • I dont know why someone down voted your answer, well can you please tell me what is the content in `@$dom->loadHTML($html);`. Can I see the content of `A$dom`? –  Mar 26 '14 at 09:58
  • 1
    I _usually_ get the serial downvoting and I had flagged to mods about this.. but unfortunately I got no help. Nevermind. I modified the code a bit.,btw why do you want to see the output of it ? – Shankar Narayana Damodaran Mar 26 '14 at 10:03
  • 1
    Thanks man! Really good example and it works perfect. Voted up! – elPresta Aug 21 '16 at 19:47
1

Here is the complete example of data scrap + get element by classname

    function get_web_page( $url )
    {
        $user_agent='Mozilla/5.0 (Windows NT 6.1; rv:8.0) Gecko/20100101 Firefox/8.0';
        $options = array(
            CURLOPT_CUSTOMREQUEST  =>"GET",        //set request type post or get
            CURLOPT_POST           =>false,        //set to GET
            CURLOPT_USERAGENT      => $user_agent, //set user agent
            CURLOPT_COOKIEFILE     =>"cookie.txt", //set cookie file
            CURLOPT_COOKIEJAR      =>"cookie.txt", //set cookie jar
            CURLOPT_RETURNTRANSFER => true,     // return web page
            CURLOPT_HEADER         => false,    // don't return headers
            CURLOPT_FOLLOWLOCATION => true,     // follow redirects
            CURLOPT_ENCODING       => "",       // handle all encodings
            CURLOPT_AUTOREFERER    => true,     // set referer on redirect
            CURLOPT_CONNECTTIMEOUT => 120,      // timeout on connect
            CURLOPT_TIMEOUT        => 120,      // timeout on response
            CURLOPT_MAXREDIRS      => 10,       // stop after 10 redirects
        );
        $ch      = curl_init( $url );
        curl_setopt_array( $ch, $options );
        $content = curl_exec( $ch );
        $err     = curl_errno( $ch );
        $errmsg  = curl_error( $ch );
        $header  = curl_getinfo( $ch );
        curl_close( $ch );

        $dom = new DOMDocument();
        $dom->loadHTML($content);
        $finder = new DomXPath($dom);
        $classname="CLASS_NAME";
        $nodes = $finder->query("//*[contains(@class, '$classname')]");

        foreach ($nodes as $key => $ele) {
            print_r($ele->nodeValue);
        }
    }

    get_web_page('DATA_SCRAP_URL_GOES_HERE');
Mihir Bhatt
  • 3,019
  • 2
  • 37
  • 41
0

You can use Simple HTML Dom parser for this.

Usage is pretty simple:

// Create a DOM object from a string
$html = str_get_html('<html><body>Hello!</body></html>');

and then you can do something like this:

// Find all element which id=foo
$ret = $html->find('#foo');

// Find all element which class=foo
$ret = $html->find('.foo');
skywalker
  • 826
  • 1
  • 10
  • 18
  • @skykwalker: I am not sure, but I read there Dom is pron to giver error in some case compare to regex, so I avoid it. Do you have anyidea about this? can you please give hint for my give link using Dom –  Mar 26 '14 at 09:25
  • 1
    Well certainly there will be errors in cases where site contains errors due to uncompleted tags, which even your regex can't fix. Some sites use javascript to generate HTML, so you have to take that into account. I suggest using either this solution or the one Shankar suggested. – skywalker Mar 26 '14 at 09:39