1

I need to scrape the number 622104 from this html

How can I get the number?

<div class="numbersBackground">
        <div id="ctl00_mainContent_playersOnlineNumberRepeater_ctl00_numberPanel" class="number">
        <div class="numberWrapper"><span>6</span></div>
    </div><div id="ctl00_mainContent_playersOnlineNumberRepeater_ctl01_numberPanel" class="number">
        <div class="numberWrapper"><span>2</span></div>
    </div><div id="ctl00_mainContent_playersOnlineNumberRepeater_ctl02_numberPanel" class="number">
        <div class="numberWrapper"><span>2</span></div>
    </div><div id="ctl00_mainContent_playersOnlineNumberRepeater_ctl03_commaPanel" class="comma">

    </div><div id="ctl00_mainContent_playersOnlineNumberRepeater_ctl04_numberPanel" class="number">
        <div class="numberWrapper"><span>1</span></div>
    </div><div id="ctl00_mainContent_playersOnlineNumberRepeater_ctl05_numberPanel" class="number">
        <div class="numberWrapper"><span>0</span></div>
    </div><div id="ctl00_mainContent_playersOnlineNumberRepeater_ctl06_numberPanel" class="number">
        <div class="numberWrapper"><span>4</span></div>
    </div>
</div>
AndrewFerrara
  • 2,383
  • 8
  • 30
  • 45

1 Answers1

2

Using the DOMDocument class to parse the HTML string, thanks to its loadHTML method, you could use an XPath query (using the DOMXpath class) to find all <div> tag with a class="numberWrapper" attribute.

Then, iterate over those, concatenating their content to a variable -- which, at the end of the loop, will contain your number.


For example, you could have this kind of code :

$str = <<<HTML
... HERE YOUR HTML ...
HTML;

$number = '';

$dom = new DOMDocument();
if ($dom->loadHTML($str)) {
    $xpath = new DOMXpath($dom);
    $results = $xpath->query('//div[@class="numberWrapper"]');
    foreach ($results as $div) {
        $number .= $div->nodeValue;
    }
}

var_dump($number);

And, as output, you'd get :

string '622104' (length=6)


You could also use the following XPath query, to make sure you're only working with the <span> tags :

$results = $xpath->query('//div[@class="numberWrapper"]/span');

Here, as the <div>s only contain the <span>, the result will be the same -- but it might change, in other situations.


Of course (just to make sure it's said) : Regular Expressions are not the right way to extract informations from an HTML string.



Edit after the comment :

If there are other <div>s you don't want to take into account, you'll have to find another XPath query -- that matches what you want to extract.

For example, maybe something like this would do the trick :

$results = $xpath->query('//div[@class="numbersBackground"]//div[@class="numberWrapper"]/span');

Of course, up to you to find out exactly what matches your the structure of your HTML document.


If you want to download the HTML, you have two solutions :


As a sidenote, if you get warnings before your HTML is not valid, you'll want to take a look at the libxml_use_internal_errors() function ;-)

Pascal MARTIN
  • 395,085
  • 80
  • 655
  • 663
  • +1: "The" correct solution if the input can be trusted to be well-formed. Beat me to it. – Jon Mar 16 '11 at 19:14
  • @Jon `DOMDocument::loadHTML` accepts code that's not XML-valid : it works with broken HTML -- if not *too* broken. – Pascal MARTIN Mar 16 '11 at 19:15
  • what if there are more divs with a class of number wrapper? and what would I use to direct the script to the webpage rather than entering a string http://www.bungie.net/stats/reach/online.aspx – AndrewFerrara Mar 16 '11 at 19:19
  • @Andrew I've edited my answer with some additional informations :-) – Pascal MARTIN Mar 16 '11 at 19:25
  • Simpler might be to just extract that snippet from the DOM, then `striptags()` to leave just the numbers. Of course, that assumes the digits in question are the only text nodes in the snippet. – Marc B Mar 16 '11 at 19:34