21

I would like to grab data from a table without using regular expressions. I've enjoyed using simplexml for parsing RSS feeds and would like to know if it can be used to grab a table from another page.

Eg. Grab the page with curl or simply file_get_contents(); then use simplexml to grab contents?

chris
  • 2,913
  • 4
  • 43
  • 48

4 Answers4

40

You can use the loadHTML function from the DOM module, and then import that DOM into SimpleXML via simplexml_import_dom:

$html = file_get_contents('http://example.com/');
$doc = new DOMDocument();
$doc->loadHTML($html);
$sxml = simplexml_import_dom($doc);
T.J. Crowder
  • 1,031,962
  • 187
  • 1,923
  • 1,875
phihag
  • 278,196
  • 72
  • 453
  • 469
  • 1
    Big +1. Added a link to `simplexml_import_dom` and a tiny bit of further explanation. – T.J. Crowder Jul 09 '11 at 16:03
  • Very nice trick. Unfortunately it looks like the DOM module isn't installed on the server I'm working on. Is it typically standard>? – chris Jul 09 '11 at 16:18
  • @chris DOM and its dependency, libxml, are both compiled in by default. They can be explicitely left out in the compilation or disabled at runtime, but that's highly unusual. – phihag Jul 09 '11 at 16:26
  • Im getting Fatal error: Class 'DOMDocument' not found in...' im assuming my school has a strange version of linux running on the server to have it missing. simplexml and libxml are available. I'll request they install it. Thx. – chris Jul 09 '11 at 16:36
7

If this is XHTML — yes, it's definitely possible. True XHTML is just XML in the end, so it can be parsed with an XML parser.

SimpleXML, however, only accepts strict XML. If you can't get valid XHTML it looks like putting it through the less-strict DOMDocument library first will do the trick (source here):

<?php
  $html = file_get_contents('http://...');
  $doc = new DOMDocument();
  $doc->strictErrorChecking = FALSE;
  $doc->loadHTML($html);
  $xml = simplexml_import_dom($doc);
?>
Jon Gauthier
  • 25,202
  • 6
  • 63
  • 69
3

My version - tolerant to errors and problems with the encoding

libxml_use_internal_errors(true);
$doc = new DOMDocument();
$doc->strictErrorChecking = FALSE;
$doc->loadHTML(mb_convert_encoding($this->html_content, 'HTML-ENTITIES',  'UTF-8'));
libxml_use_internal_errors(false);
$xml = simplexml_import_dom($doc);
Maciej Niemir
  • 604
  • 6
  • 8
0

It may depend on a page. If page is in XHTML (most web pages nowadays) then any XML parser should do, otherwise look for SGML parser. Here's a similar question, you might be interested in: Error Tolerant HTML/XML/SGML parsing in PHP

Community
  • 1
  • 1
Piotr Turek
  • 346
  • 1
  • 2
  • 11
  • 2
    MOST web pages? Source for that data pls? Also, please dig around SO (or internet in general) to find out why people usually don't serve XHTML correctly. – Mchl Jul 09 '11 at 16:03