Load and Parse a portion of External HTML

Question

I want to extract(parse) a portion HTML document from an external website using php

For example: To extract news from yahoo, i tried using SimpleHTML DOM Parser from sourceforge

<?php
$url="http://news.yahoo.com/einsteins-brain-now-interactive-ipad-app-071441969.html";
include('simple_html_dom.php');  
$html=new simple_html_dom();
$html->load_file($url);
$xxx=$html->find('title')->innertext; 
echo $xxx;
?>

Fatal error: Call to a member function find() on a non-object in /home/a1234bc/public_html/simple_html_dom.php on line 1113

Then I tried to echo the loaded html

<?php
$url="http://news.yahoo.com/einsteins-brain-now-interactive-ipad-app-071441969.html";
include('simple_html_dom.php');  
$html=new simple_html_dom();
$html->load_file($url);
echo $html;
?>

Now I get:

Fatal error: Call to a member function innertext() on a non-object in /home/a1234bc/public_html/simple_html_dom.php on line 1688

I also tried using DOMDocument() through file_get_contents()

<?php
$url="http://news.yahoo.com/einsteins-brain-now-interactive-ipad-app-071441969.html";
$content = file_get_contents($url);
// echo $content works perfect

$doc = new DOMDocument();
$doc->loadHTML($content);
$jjj=$doc->getElementsByTagName('title')->item(0);
echo $jjj;
?>

This throws up a very long list of Warnings. So let me just copy paste the first 10 alone

Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: htmlParseEntityRef: no name in Entity, line: 166 in /home/a1234bc/public_html/simple_html_dom.php on line 37

Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: htmlParseEntityRef: expecting ';' in Entity, line: 166 in /home/a1234bc/public_html/simple_html_dom.php on line 37

Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: htmlParseEntityRef: no name in Entity, line: 256 in /home/a1234bc/public_html/simple_html_dom.php on line 37

Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: htmlParseEntityRef: expecting ';' in Entity, line: 256 in /home/a1234bc/public_html/simple_html_dom.php on line 37

Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: Tag fb:login-button invalid in Entity, line: 256 in /home/a1234bc/public_html/simple_html_dom.php on line 37

Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: htmlParseEntityRef: expecting ';' in Entity, line: 275 in /home/a1234bc/public_html/simple_html_dom.php on line 37

Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: htmlParseEntityRef: expecting ';' in Entity, line: 287 in /home/a1234bc/public_html/simple_html_dom.php on line 37

Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: htmlParseEntityRef: expecting ';' in Entity, line: 292 in /home/a1234bc/public_html/simple_html_dom.php on line 37

Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: htmlParseEntityRef: expecting ';' in Entity, line: 311 in /home/a1234bc/public_html/simple_html_dom.php on line 37

Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: Attribute class redefined in Entity, line: 325 in /home/a1234bc/public_html/simple_html_dom.php on line 37

Can someone please point me to the right direction?

score 0 · Answer 1 · answered Feb 14 '13 at 11:50

I got the same error when using the Object-oriented way as shown in the manual:

// Create a DOM object
$html = new simple_html_dom();

// Load HTML from a string
$html->load('<html><body>Hello!</body></html>');

// Load HTML from a URL 
$html->load_file('http://www.google.com/');

// Load HTML from a HTML file 
$html->load_file('test.htm');

Got rid of the error and got my script working when I switched to the Quick way as shown in the manual:

// Create a DOM object from a string
$html = str_get_html('<html><body>Hello!</body></html>');

// Create a DOM object from a URL
$html = file_get_html('http://www.google.com/');

// Create a DOM object from a HTML file
$html = file_get_html('test.htm');

After this $html->find worked just fine!

The PHP Simple HTML DOM Parser manual can be found here: http://simplehtmldom.sourceforge.net/manual.htm

Hope this helps!

score -1 · Answer 2 · answered Sep 25 '12 at 11:31

-1

DOMDocument/SimpleXML are designed for parsing XML not HTML. You would need to use file_get_contents to get the HTML into a string and then using string manipulation functions to get the portion you need. preg_match_all would be a good place to start.

answered Sep 25 '12 at 11:31

scottlimmer

2,230
1
22
29

1

OP uses "PHP Simple HTML DOM Parser" and links to it. And as you've probably seen in the FAQ, [You can't parse HTML with regex](http://stackoverflow.com/a/1732454/927408) – Jørgen R Feb 14 '13 at 11:55

Load and Parse a portion of External HTML

2 Answers2