0

I want to scrape a star based rating, that is the corresponding code

<div class="product_detail_info_rating_stars">
    <div class="product_detail_star full"></div>
    <div class="product_detail_star full"></div>
    <div class="product_detail_star full"></div>
    <div class="product_detail_star full"></div>
    <div class="product_detail_star"></div>
</div>

Every rating has this codesnippet. I am looking for a way to convert these snippets into numbers like this one would be a 4 (4 of 5 stars).

The way that comes to my mind is to match the whole block for each rating and then match the full class and count it, but maybe there is a better way that I am not seeing.

Is there a better way to solve this problem?

Thanks!

rootman
  • 660
  • 1
  • 8
  • 18
  • 1
    What have you tried so far? What DOM library are you using? Why do you think you need a regexp? – Álvaro González Oct 16 '12 at 09:21
  • 1
    http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 You really ought to use a proper HTML parser, there's even one built into PHP (DOMDocument). – GordonM Oct 16 '12 at 09:24
  • I am not using a DOM library as it is just a small scraping script for a wordpress plugin. I am currently working on the regex to match the inner divs and then i would loop through the matches and search for full. `/
    (
    )+
    <\/div>/msU` is what i've got so far, needs testing though as I am not fluent at all in RegEx.
    – rootman Oct 16 '12 at 09:25
  • @GordonM I'll look into the parser, thanks. – rootman Oct 16 '12 at 09:31

1 Answers1

2

Here is a quick example of how you can use SimpleXML parser and XPath.

// Get your page HTML string
$html = file_get_contents('1page.htm');

// To suppress invalid markup warnings
libxml_use_internal_errors(true);

// Create SimpleXML object
$doc = new DOMDocument();
$doc->strictErrorChecking = false;
$doc->loadHTML($html);
$xml = simplexml_import_dom($doc);

// Find a nodes
$blocks = $xml->xpath('//div[contains(@class, "product_detail_info_rating_stars")]');

foreach ($blocks as $block)
{
    $count = 0;
    foreach ($block->children() as $child) {
        if ($child['class'] == 'product_detail_star full') {
            $count++;
        }
    }
    echo '<pre>'; print_r('Rating: ' . $count . ' of ' . $block->count()); echo '</pre>';
}

// Clear invalid markup error buffer
libxml_clear_errors();

For test html page like this:

<!doctype html>
<html>
<head></head>
<body>

<table>
    <tr>
        <td>
            <div class="product_detail_info_rating_stars">
                <div class="product_detail_star full"></div>
                <div class="product_detail_star"></div>
                <div class="product_detail_star"></div>
                <div class="product_detail_star"></div>
                <div class="product_detail_star"></div>
            </div>
        </td>
    </tr>
    <tr>
        <td>
            <div class="product_detail_info_rating_stars">
                <div class="product_detail_star full"></div>
                <div class="product_detail_star full"></div>
                <div class="product_detail_star"></div>
                <div class="product_detail_star"></div>
                <div class="product_detail_star"></div>
            </div>
        </td>
    </tr>
    <tr>
        <td>
            <div class="product_detail_info_rating_stars">
                <div class="product_detail_star full"></div>
                <div class="product_detail_star full"></div>
                <div class="product_detail_star full"></div>
                <div class="product_detail_star full"></div>
                <div class="product_detail_star"></div>
            </div>
        </td>
    </tr>
</table>

</body>
</html>

It will output something like:

Rating: 1 of 5
Rating: 2 of 5
Rating: 4 of 5

Play with this to adjust to your needs.

dfsq
  • 191,768
  • 25
  • 236
  • 258