1

I am trying to access a HTML page and get a certain number from a div that is generated dynamically.

<span itemprop="average" content="XX"></span>

I want to retrieve the "XX" as a variable, which will be different for each page.

Is this done with HTML parsing or a simple preg_replace?

Thanks

626
  • 1,159
  • 2
  • 16
  • 27

3 Answers3

2

If you are just getting started into scraping I would recommend Imacros or import.io I started using them when beginning my scraping tasks and I started to understand how it all works a bit better. It is also very helpful to use cUrl with Php when scraping it will be your best friend

Jake Ison
  • 139
  • 1
  • 2
  • 9
1

Do not use REGEX to parse HTML. Best way is to use a parser. PHP5 comes with some imbued, like DOMDocument and DOMXPath.

Here's an example using the two for completion:

$html = '<html><head></head><body>
<span itemprop="average" content="XX">some text</span>
<span itemprop="not_average">other text</span>
</body></html>';


$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);

$nodelist = $xpath->query( "//span[@itemprop='average']" );

foreach ($nodelist as $node){
    print $node->getAttribute('content')."<br>";
}

The only "catch" is that DOMDocument parser is a lot more restrictive than the browsers parser and will "hickup" on some pages extracted from the interwebs.

Tivie
  • 18,864
  • 5
  • 58
  • 77
  • This, and many others! See SO Question [HTML Pharsers in PHP](http://stackoverflow.com/a/3577662/292735), If you know the structure, XPath will be able to pick up your value in almost a single line of code. – MackieeE Nov 12 '13 at 19:17
  • 1
    I suggest using DOMXpath::evaluate() not DOMXpath::query(). Evaluate can return scalars, not only node lists. This allows to fetch the value as a string, with a single line of XPath: "string(//span[@itemprop='average']/@content)" – ThW Nov 13 '13 at 08:54
0

Using DOM is usually the best idea for stuff like that.

$html = <<<HTML
<html>
  <body>
    <span itemprop="average" content="XX"></span>
  </body>
</html>
HTML;

libxml_use_internal_errors(TRUE);

$dom = new DOMDocument();
$dom->loadHtml($html);
$xpath = new DOMXpath($dom);

$content = $xpath->evaluate('string(//span[@itemprop = "average"]/@content)');

var_dump($content);

libxml_use_internal_errors() disables the error output for bad html. You can use libxml_get_errors() to read them and libxml_clear_errors() to clear the current error buffer.

Next a DOMDocument is created and html is loaded. DOMDocument::loadHtmlFile() would allow to load it from a file or url.

After loading the document you can create an DOMXpath object for the loaded document, it allows you to query elements from it.

DOMXpath::evaluate() allows you to query node lists and scalars from the document. The string typecast inside the xpath expressions casts the attribute node to a string and returns the value. Without the typecast the result would be an DOMNodelist containing any count of DOMAttribute nodes. With it, the result is the attribute value or an empty string.

ThW
  • 19,120
  • 3
  • 22
  • 44