0

I need to read some content from a html page. I've tested simple_html_dom, but it simply isn't usable for what I need it for.

I need something like this (pseaudo syntax based on simple_html_dom):

$html = file_get_contents($url);
$html_obj = parse_html($html);

$title = $html_obj->get('title');
$meta1 = $html_obj->get('meta[name=description]', 'innertext']; //text only
$meta2 = $html_obj->get('meta[name=keywords]', 'innertext']; // text only
$content = $html_obj->get('div[id=section_a]', outerText); //html code

I've tested simple_html_dom in so many ways, and only managed to get parts of what I need. It simply isn't "simple".

I've also tested PHP DOMDocument::loadHTML, but it I run in to problems dealing with inline <script>.

Are there any php librarys that makes it as easy to get content as in jQuery?

Update

One of my problems is a a piece of 3rd party javascript from an add agency:

    <script language="javascript" type="text/javascript">
      <!--
        if (window.adgroupid == undefined) {
          window.adgroupid = Math.round(Math.random()*100000);
        }
        document.write('<scr'+'ipt language="javascript1.1" type="text/javascript" src="http://adserver.adtech.de/addyn|3.0|994|3159100|0|-1|size=980x150|ADTECH;loc=100;target=_blank;key=startside,kvinner, kvinnesak, bryllup, graviditet, mamma, kosmetikk, markedsplass, dagbok, feminisme;grp='+window.adgroupid+';misc='+new Date().getTime()+'"></scri'+'pt>');
      //-->
      </script>

Even if I change <scr'+'ipt to <script it gives me invalid javascript code.

Steven
  • 19,224
  • 47
  • 152
  • 257

2 Answers2

2

You can use DOMDocument with DOMXPath ..

<?php
$DOMDocument = new DOMDocument();
//libxml_use_internal_errors ( true ) ;
$DOMDocument->loadHTMLFile ( 'http://www.iconfinder.com' ) ;
$XPath = new DOMXPath( $DOMDocument );

$title = $DOMDocument->getElementsByTagName('title')->item(0)->nodeValue;
echo $title ;

#$desc = $XPath->query('//meta[@name=description]')->item(0)->getAttribute ( 'content' );
#$keywords = $XPath->query('//meta[@name=keywords]')->item(0)->getAttribute( 'content' );
#$content = $XPath->query('//div[@id=section_a]')->item(0)->nodeValue;
  • see updated question. I'm not able to laod the html code because of the inline javascript. Do you know how I can fix this? – Steven Dec 03 '11 at 09:08
1

PHPQuery (http://code.google.com/p/phpquery/) allows you to manipulate HTML through a jquery like syntax

Peter Horne
  • 6,472
  • 7
  • 39
  • 50