0

I have a code that reads an HTML file from my local web server localhost and then converts it to XHTML with tidy. Then i load that XHTML into my DOM. the code looks like this

<?php


function getXHTML($html)
{
    $options = array("output-html" => true,"quote-nbsp" => true, "drop-proprietary-attributes" => true,"drop-font-tags" => true,"drop-empty-paras" => true,"hide-comments" => true);
    $tidy=new tidy();
    $xhtml=$tidy->repairString($html,$options);
    echo $xhtml;
    return $xhtml;
}
$content = file_get_contents("http://localhost/filename.htm");
$page = new DOMDocument();
$xpath=new DOMXPath($page);
$content = getXHTML($content);   // this is a tidy function to return XHTML
$page->loadHTML($content);   
$totalPath = "//body/table[3]/tbody/tr[1]/td[4]";
$total = $xpath->query($totalPath);
echo $total->length;    // this shows zero
?> 

the contents of filename.htm looks like this

<!-- saved from url=(0041)http://www.rtu.ac.in/results/reformat.php -->
<html><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<link rel="SHORTCUT ICON" href="http://www.rtu.ac.in/favicon.ico">
<link href="./Result - Rajasthan Technical University6_files/styleresults.css" rel="stylesheet" type="text/css">
<title>Result - Rajasthan Technical University</title>
</head>
<body>


<table width="773" cellpadding="5" cellspacing="0" align="center">
  <tbody><tr height="60">
    <td width="16%" height="60" valign="top"><font color="brown" size="+2"><img src="./Result - Rajasthan Technical University6_files/logo.jpg" width="100" height="102" border="0" align="right">&nbsp;</font></td>
    <td width="72%" height="60" align="center" valign="top"><p><font color="brown" size="+2"><strong>RAJASTHAN TECHNICAL UNIVERSITY </strong></font></p><font color="brown" size="+2">
      <p><font size="+1"><strong>B.Tech -IVth SEMESTER -2010(Main) 16.5.2011</strong></font></p><font size="+1">&nbsp;</font></font></td>      
    <td width="12%" height="80"><strong>www.rtu.ac.in</strong>&nbsp;</td>
  </tr>
</tbody></table>



<br>
<br>
<table width="783" align="center" cellpadding="5" cellspacing="0" class="table"> 
  <tbody>
    <tr>
      <td width="34%" align="center" valign="top" rowspan="2"><strong>Subject(s) Name </strong>&nbsp;</td>
      <td width="10%" align="center" valign="top" colspan="1" rowspan="2"> <strong>Subject(s) Code </strong>&nbsp;</td>

      <td align="center" valign="top" colspan="3" rowspan="1"><strong>Marks Obtained </strong>&nbsp;</td>
    </tr>


    <tr>
      <td width="20%" align="center"><strong>Internal</strong>&nbsp;</td>
      <td width="18%" align="center"><strong>Theory</strong>&nbsp;</td>
      <td width="18%" align="center">&nbsp;</td>
    </tr>




        <tr>
          <td width="34%" align="center" style=" border-bottom: 0px none transparent;"><strong>SUBJECT-1</strong>&nbsp;</td>
          <td width="10%" align="center" style=" border-bottom: 0px none transparent;">4551</td>

      <td width="20%" align="center" style=" border-bottom: 0px none transparent;"> 16</td>
      <td width="18%" align="center" style=" border-bottom: 0px none transparent;"> 50</td>
      <td width="18%" align="center" style=" border-bottom: 0px none transparent;">&nbsp;</td>
      </tr>

        <tr>
          <td width="34%" align="center" style=" border-bottom: 0px none transparent;"><strong>SUBJECT-2</strong>&nbsp;</td>
          <td width="10%" align="center" style=" border-bottom: 0px none transparent;">&nbsp;4552</td>

      <td width="20%" align="center" style=" border-bottom: 0px none transparent;"> 17</td>
      <td width="18%" align="center" style=" border-bottom: 0px none transparent;"> 61</td>
      <td width="18%" align="center" style=" border-bottom: 0px none transparent;">&nbsp;</td>
      </tr>

        <tr>
          <td width="34%" align="center" style=" border-bottom: 0px none transparent;"><strong>SUBJECT-3</strong>&nbsp;</td>
          <td width="10%" align="center" style=" border-bottom: 0px none transparent;">4553</td>

      <td width="20%" align="center" style=" border-bottom: 0px none transparent;"> 19</td>
      <td width="18%" align="center" style=" border-bottom: 0px none transparent;"> 49</td>
      <td width="18%" align="center" style=" border-bottom: 0px none transparent;">&nbsp;</td>
      </tr>
        <tr>
          <td align="center" style=" border-bottom: 0px none transparent;"><strong>SUBJECT-4</strong>&nbsp;</td>
          <td align="center" style=" border-bottom: 0px none transparent;">4554</td>
          <td align="center" style=" border-bottom: 0px none transparent;"> 14</td>
          <td align="center" style=" border-bottom: 0px none transparent;"> 68</td>
          <td align="center" style=" border-bottom: 0px none transparent;">&nbsp;</td>
        </tr>
        <tr>
          <td align="center" style=" border-bottom: 0px none transparent;"><strong>SUBJECT-5</strong>&nbsp;</td>
          <td align="center" style=" border-bottom: 0px none transparent;">4555</td>
          <td align="center" style=" border-bottom: 0px none transparent;"> 14</td>
          <td align="center" style=" border-bottom: 0px none transparent;"> 36</td>
          <td align="center" style=" border-bottom: 0px none transparent;">&nbsp;</td>
        </tr>

        <tr>
          <td width="34%" align="center" style=" border-bottom: 0px none transparent;"><strong>SUBJECT-6</strong>&nbsp;</td>
          <td width="10%" align="center" style=" border-bottom: 0px none transparent;">4556</td>

      <td width="20%" align="center" style=" border-bottom: 0px none transparent;"> 19</td>
      <td width="18%" align="center" style=" border-bottom: 0px none transparent;"> 48</td>
      <td width="18%" align="center" style=" border-bottom: 0px none transparent;">&nbsp;</td>
      </tr><tr>
          <td align="center" style=" border-bottom: 0px none transparent;">&nbsp;&nbsp;</td>
          <td align="center" style=" border-bottom: 0px none transparent;">&nbsp;</td>
          <td align="center" style=" border-bottom: 0px none transparent;">&nbsp;&nbsp;</td>
          <td align="center" style=" border-bottom: 0px none transparent;">&nbsp;<strong>Internal</strong>&nbsp;</td>
          <td width="18%" align="center" style=" border-bottom: 0px none transparent;"><strong>Practical</strong>&nbsp;</td>
        </tr>

        <tr>
          <td width="34%" align="center" style=" border-bottom: 0px none transparent;"><strong>PSUBJECT-1</strong>&nbsp;</td>
          <td width="10%" align="center" style=" border-bottom: 0px none transparent;">4174</td>

      <td width="20%" align="center" style=" border-bottom: 0px none transparent;">&nbsp;</td>
      <td width="18%" align="center" style=" border-bottom: 0px none transparent;"> 29</td>
      <td width="18%" align="center" style=" border-bottom: 0px none transparent;">48</td>
      </tr>




        <tr>
          <td width="34%" align="center" style=" border-bottom: 0px none transparent;"><strong>PSUBJECT-2</strong>&nbsp;</td>
          <td width="10%" align="center" style=" border-bottom: 0px none transparent;">4175</td>

      <td width="20%" align="center" style=" border-bottom: 0px none transparent;">&nbsp;</td>
      <td width="18%" align="center" style=" border-bottom: 0px none transparent;"> 16</td>
      <td width="18%" align="center" style=" border-bottom: 0px none transparent;">26</td>
      </tr>

      <tr>
          <td width="34%" align="center" style=" border-bottom: 0px none transparent;"><strong>PSUBJECT-3</strong>&nbsp;</td>
          <td width="10%" align="center" style=" border-bottom: 0px none transparent;">4171</td>

      <td width="20%" align="center" style=" border-bottom: 0px none transparent;">&nbsp;</td>
      <td width="18%" align="center" style=" border-bottom: 0px none transparent;"> 15</td>
      <td width="18%" align="center" style=" border-bottom: 0px none transparent;">27</td>
      </tr>
      <tr>
        <td align="center" style=" border-bottom: 0px none transparent;"><strong>PSUBJECT-4</strong>&nbsp;</td>
        <td align="center" style=" border-bottom: 0px none transparent;">4172</td>
        <td align="center" style=" border-bottom: 0px none transparent;">&nbsp;</td>
        <td align="center" style=" border-bottom: 0px none transparent;"> 17</td>
        <td align="center" style=" border-bottom: 0px none transparent;">29</td>
        </tr>
      <tr>
        <td align="center" style=" border-bottom: 0px none transparent;"><strong>PSUBJECT-5</strong>&nbsp;</td>
        <td align="center" style=" border-bottom: 0px none transparent;">4173</td>
        <td align="center" style=" border-bottom: 0px none transparent;">&nbsp;</td>
        <td align="center" style=" border-bottom: 0px none transparent;"> 29</td>
        <td align="center" style=" border-bottom: 0px none transparent;">46</td>
        </tr>




        <tr>
          <td width="34%" align="center" style=" border-bottom: 0px none transparent;"><strong>Disipline (Deca)</strong>&nbsp;</td>
          <td width="10%" align="center" style=" border-bottom: 0px none transparent;">4176</td>

      <td width="20%" align="center" style=" border-bottom: 0px none transparent;">&nbsp;</td>
      <td width="18%" align="center" style=" border-bottom: 0px none transparent;">&nbsp;</td>
      <td width="18%" align="center" style=" border-bottom: 0px none transparent;">46</td>
      </tr>
  <tr><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr></tbody>
</table>

<br><table width="783" align="center" cellpadding="5" cellspacing="0" class="table">
  <tbody><tr>

    <td width="18%" align="center" valign="top"><strong>Practical Marks   </strong>&nbsp;</td>
    <td width="18%" align="center" valign="top">328</td>
    <td width="19%" align="center" valign="top"><strong>Theory Marks </strong>&nbsp;</td>
    <td width="19%" align="center" valign="top">411</td>
  </tr>

  <tr>
    <td width="18%" align="center"><strong>Institute Code   </strong>&nbsp;</td>
    <td width="18%" align="center"> 1229 </td>
    <td width="19%" align="center"><strong>DECCA </strong>&nbsp;</td>
    <td width="19%" align="center">4176</td>
  </tr>

  <tr>

    <td width="18%" align="center"><strong>Division   </strong>&nbsp;</td>
    <td width="18%" align="center"> PASS </td>
    <td width="19%" align="center"><strong>Grand Total </strong>&nbsp;</td>
    <td width="19%" align="center">739</td>
  </tr>
  </tbody></table>


&nbsp;&nbsp; 
<!-- Reformatter by Shashank Kumar Jain (CS, IIIrd Year, 2010-11) -->


<div id="csscan-wrapper" style="display: none; "><h2 id="csscan-header">element</h2><table id="csscan-table"><tbody><tr><th colspan="2" id="csscan-header-font" class="csscan-header">Font</th></tr><tr id="csscan-row-font-family"><td id="csscan-property-font-family" class="csscan-property">font-family</td><td id="csscan-value-font-family" class="csscan-value"></td></tr><tr id="csscan-row-font-size"><td id="csscan-property-font-size" class="csscan-property">font-size</td><td id="csscan-value-font-size" class="csscan-value"></td></tr><tr id="csscan-row-font-style"><td id="csscan-property-font-style" class="csscan-property">font-style</td><td id="csscan-value-font-style" class="csscan-value"></td></tr><tr id="csscan-row-font-variant"><td id="csscan-property-font-variant" class="csscan-property">font-variant</td><td id="csscan-value-font-variant" class="csscan-value"></td></tr><tr id="csscan-row-font-weight"><td id="csscan-property-font-weight" class="csscan-property">font-weight</td><td id="csscan-value-font-weight" class="csscan-value"></td></tr><tr id="csscan-row-letter-spacing"><td id="csscan-property-letter-spacing" class="csscan-property">letter-spacing</td><td id="csscan-value-letter-spacing" class="csscan-value"></td></tr><tr id="csscan-row-line-height"><td id="csscan-property-line-height" class="csscan-property">line-height</td><td id="csscan-value-line-height" class="csscan-value"></td></tr><tr id="csscan-row-text-decoration"><td id="csscan-property-text-decoration" class="csscan-property">text-decoration</td><td id="csscan-value-text-decoration" class="csscan-value"></td></tr><tr id="csscan-row-text-align"><td id="csscan-property-text-align" class="csscan-property">text-align</td><td id="csscan-value-text-align" class="csscan-value"></td></tr><tr id="csscan-row-text-indent"><td id="csscan-property-text-indent" class="csscan-property">text-indent</td><td id="csscan-value-text-indent" class="csscan-value"></td></tr><tr id="csscan-row-text-transform"><td id="csscan-property-text-transform" class="csscan-property">text-transform</td><td id="csscan-value-text-transform" class="csscan-value"></td></tr><tr id="csscan-row-white-space"><td id="csscan-property-white-space" class="csscan-property">white-space</td><td id="csscan-value-white-space" class="csscan-value"></td></tr><tr id="csscan-row-word-spacing"><td id="csscan-property-word-spacing" class="csscan-property">word-spacing</td><td id="csscan-value-word-spacing" class="csscan-value"></td></tr><tr id="csscan-row-color"><td id="csscan-property-color" class="csscan-property">color</td><td id="csscan-value-color" class="csscan-value"></td></tr><tr><th colspan="2" id="csscan-header-background" class="csscan-header">Background</th></tr><tr id="csscan-row-background-attachment"><td id="csscan-property-background-attachment" class="csscan-property">bg-attachment</td><td id="csscan-value-background-attachment" class="csscan-value"></td></tr><tr id="csscan-row-background-color"><td id="csscan-property-background-color" class="csscan-property">bg-color</td><td id="csscan-value-background-color" class="csscan-value"></td></tr><tr id="csscan-row-background-image"><td id="csscan-property-background-image" class="csscan-property">bg-image</td><td id="csscan-value-background-image" class="csscan-value"></td></tr><tr id="csscan-row-background-position"><td id="csscan-property-background-position" class="csscan-property">bg-position</td><td id="csscan-value-background-position" class="csscan-value"></td></tr><tr id="csscan-row-background-repeat"><td id="csscan-property-background-repeat" class="csscan-property">bg-repeat</td><td id="csscan-value-background-repeat" class="csscan-value"></td></tr><tr><th colspan="2" id="csscan-header-size" class="csscan-header">Box</th></tr><tr id="csscan-row-width"><td id="csscan-property-width" class="csscan-property">width</td><td id="csscan-value-width" class="csscan-value"></td></tr><tr id="csscan-row-height"><td id="csscan-property-height" class="csscan-property">height</td><td id="csscan-value-height" class="csscan-value"></td></tr><tr id="csscan-row-border-top"><td id="csscan-property-border-top" class="csscan-property">border-top</td><td id="csscan-value-border-top" class="csscan-value"></td></tr><tr id="csscan-row-border-right"><td id="csscan-property-border-right" class="csscan-property">border-right</td><td id="csscan-value-border-right" class="csscan-value"></td></tr><tr id="csscan-row-border-bottom"><td id="csscan-property-border-bottom" class="csscan-property">border-bottom</td><td id="csscan-value-border-bottom" class="csscan-value"></td></tr><tr id="csscan-row-border-left"><td id="csscan-property-border-left" class="csscan-property">border-left</td><td id="csscan-value-border-left" class="csscan-value"></td></tr><tr id="csscan-row-margin"><td id="csscan-property-margin" class="csscan-property">margin</td><td id="csscan-value-margin" class="csscan-value"></td></tr><tr id="csscan-row-padding"><td id="csscan-property-padding" class="csscan-property">padding</td><td id="csscan-value-padding" class="csscan-value"></td></tr><tr id="csscan-row-max-height"><td id="csscan-property-max-height" class="csscan-property">max-height</td><td id="csscan-value-max-height" class="csscan-value"></td></tr><tr id="csscan-row-min-height"><td id="csscan-property-min-height" class="csscan-property">min-height</td><td id="csscan-value-min-height" class="csscan-value"></td></tr><tr id="csscan-row-max-width"><td id="csscan-property-max-width" class="csscan-property">max-width</td><td id="csscan-value-max-width" class="csscan-value"></td></tr><tr id="csscan-row-min-width"><td id="csscan-property-min-width" class="csscan-property">min-width</td><td id="csscan-value-min-width" class="csscan-value"></td></tr><tr id="csscan-row-outline-color"><td id="csscan-property-outline-color" class="csscan-property">outline-color</td><td id="csscan-value-outline-color" class="csscan-value"></td></tr><tr id="csscan-row-outline-style"><td id="csscan-property-outline-style" class="csscan-property">outline-style</td><td id="csscan-value-outline-style" class="csscan-value"></td></tr><tr id="csscan-row-outline-width"><td id="csscan-property-outline-width" class="csscan-property">outline-width</td><td id="csscan-value-outline-width" class="csscan-value"></td></tr><tr><th colspan="2" id="csscan-header-position" class="csscan-header">Positioning</th></tr><tr id="csscan-row-position"><td id="csscan-property-position" class="csscan-property">position</td><td id="csscan-value-position" class="csscan-value"></td></tr><tr id="csscan-row-top"><td id="csscan-property-top" class="csscan-property">top</td><td id="csscan-value-top" class="csscan-value"></td></tr><tr id="csscan-row-bottom"><td id="csscan-property-bottom" class="csscan-property">bottom</td><td id="csscan-value-bottom" class="csscan-value"></td></tr><tr id="csscan-row-right"><td id="csscan-property-right" class="csscan-property">right</td><td id="csscan-value-right" class="csscan-value"></td></tr><tr id="csscan-row-left"><td id="csscan-property-left" class="csscan-property">left</td><td id="csscan-value-left" class="csscan-value"></td></tr><tr id="csscan-row-float"><td id="csscan-property-float" class="csscan-property">float</td><td id="csscan-value-float" class="csscan-value"></td></tr><tr id="csscan-row-display"><td id="csscan-property-display" class="csscan-property">display</td><td id="csscan-value-display" class="csscan-value"></td></tr><tr id="csscan-row-clear"><td id="csscan-property-clear" class="csscan-property">clear</td><td id="csscan-value-clear" class="csscan-value"></td></tr><tr id="csscan-row-z-index"><td id="csscan-property-z-index" class="csscan-property">z-index</td><td id="csscan-value-z-index" class="csscan-value"></td></tr><tr><th colspan="2" id="csscan-header-list" class="csscan-header">List</th></tr><tr id="csscan-row-list-style-image"><td id="csscan-property-list-style-image" class="csscan-property">list-style-image</td><td id="csscan-value-list-style-image" class="csscan-value"></td></tr><tr id="csscan-row-list-style-type"><td id="csscan-property-list-style-type" class="csscan-property">list-style-type</td><td id="csscan-value-list-style-type" class="csscan-value"></td></tr><tr id="csscan-row-list-style-position"><td id="csscan-property-list-style-position" class="csscan-property">list-style-position</td><td id="csscan-value-list-style-position" class="csscan-value"></td></tr><tr><th colspan="2" id="csscan-header-table" class="csscan-header">Table</th></tr><tr id="csscan-row-vertical-align"><td id="csscan-property-vertical-align" class="csscan-property">vertical-align</td><td id="csscan-value-vertical-align" class="csscan-value"></td></tr><tr id="csscan-row-border-collapse"><td id="csscan-property-border-collapse" class="csscan-property">border-collapse</td><td id="csscan-value-border-collapse" class="csscan-value"></td></tr><tr id="csscan-row-border-spacing"><td id="csscan-property-border-spacing" class="csscan-property">border-spacing</td><td id="csscan-value-border-spacing" class="csscan-value"></td></tr><tr id="csscan-row-caption-side"><td id="csscan-property-caption-side" class="csscan-property">caption-side</td><td id="csscan-value-caption-side" class="csscan-value"></td></tr><tr id="csscan-row-empty-cells"><td id="csscan-property-empty-cells" class="csscan-property">empty-cells</td><td id="csscan-value-empty-cells" class="csscan-value"></td></tr><tr id="csscan-row-table-layout"><td id="csscan-property-table-layout" class="csscan-property">table-layout</td><td id="csscan-value-table-layout" class="csscan-value"></td></tr><tr><th colspan="2" id="csscan-header-effects" class="csscan-header">Effects</th></tr><tr id="csscan-row-text-shadow"><td id="csscan-property-text-shadow" class="csscan-property">text-shadow</td><td id="csscan-value-text-shadow" class="csscan-value"></td></tr><tr id="csscan-row--webkit-box-shadow"><td id="csscan-property--webkit-box-shadow" class="csscan-property">-webkit-box-shadow</td><td id="csscan-value--webkit-box-shadow" class="csscan-value"></td></tr><tr id="csscan-row-border-radius"><td id="csscan-property-border-radius" class="csscan-property">border-radius</td><td id="csscan-value-border-radius" class="csscan-value"></td></tr><tr><th colspan="2" id="csscan-header-other" class="csscan-header">Other</th></tr><tr id="csscan-row-overflow"><td id="csscan-property-overflow" class="csscan-property">overflow</td><td id="csscan-value-overflow" class="csscan-value"></td></tr><tr id="csscan-row-cursor"><td id="csscan-property-cursor" class="csscan-property">cursor</td><td id="csscan-value-cursor" class="csscan-value"></td></tr><tr id="csscan-row-visibility"><td id="csscan-property-visibility" class="csscan-property">visibility</td><td id="csscan-value-visibility" class="csscan-value"></td></tr></tbody></table></div></body></html>

the XPath above is correct as i have checked it with FirePath. can anyone tell me what i am doing wrong?

BenMorel
  • 34,448
  • 50
  • 182
  • 322
lovesh
  • 5,235
  • 9
  • 62
  • 93
  • its because its not perfect xml `@$page->loadXML($content);` will suppress warnings – Lawrence Cherone Jun 03 '11 at 01:02
  • @Lawrence Cherone: but how do i make it it perfect? i am using tidy and moreover even on appending a `@` at the beginning doesnt show the content, it shows suppresses the warnings. i have used options of `loadXML` like `LIBXML_NOENT` to substitute entity tags but it doent work. Help? – lovesh Jun 03 '11 at 01:07
  • 1
    x/html will never be XML as its HTML & not specifically formed, use Ivan's answer below to load HTML – Lawrence Cherone Jun 03 '11 at 01:10
  • @Lawrence Cherone: i cant use `loadHTML` for reasons i have mentioned below. this post is an step to solve [this problem](http://stackoverflow.com/questions/6168558/unable-to-scrape-content-from-a-website) – lovesh Jun 03 '11 at 01:13
  • 1
    you have already been advised then of the best approach. asking again will yield the same answers – Lawrence Cherone Jun 03 '11 at 01:17
  • @Lawrence Cherone: sorry i didnt get you. which approach are you talking about? – lovesh Jun 03 '11 at 01:18
  • @Lawrence Cherone:I have mentioned i cant use `loadHTML` because i have to use `XPATH` and it doesnt work with `loadHTML` only with `loadXML` – lovesh Jun 03 '11 at 01:23
  • @Lawrence Cherone: have a look at [this](http://www.php.net/manual/en/domxpath.registernamespace.php#83712) – lovesh Jun 03 '11 at 01:25
  • @lovesh XPath works with DOM tree no matter how you load it (`loadXML` or `loadHTML`). See my update. Your problem must be in other place. May incorrect namespace handling (just guessing). – Ivan Nevostruev Jun 03 '11 at 17:40
  • @Ivan Nevostruev: you are right XPath works with `loadHTML`. i recently tried using that with wikipedia and some other sites and it worked. – lovesh Jun 04 '11 at 02:38

3 Answers3

4

Try to use loadHTML($string) instead of loadXML. From manual:

The function parses the HTML contained in the string source. Unlike loading XML, HTML does not have to be well-formed to load.

Update 1

loadHTML creates the same DOM tree in memory as loadXML does. It only uses less strict parser. Here is example code with XPath:

<?php
$content = file_get_contents("1.html");
$page = new DOMDocument();
$page->loadHTML($content);   // this will ignore most errors in formating
echo $page->saveHTML();
echo "=====\n";
$xpath = new DOMXPath($page); // use any "XML" parsing function
foreach ($xpath->query("//li[not(@id='3')]") as $elem) {
        echo "[".trim($elem->textContent)."]\n";
}

Content of 1.html file is:

<li id="1">item 1
<li id="2">item 2
<li id="3">item 3
<li id="4">item 4

Output will be:

<!DOCTYPE html PUBLIC "...">
<html><body>
<li id="1">item 1
</li>
<li id="2">item 2
</li>
<li id="3">item 3
</li>
<li id="4">item 4
</li>
</body></html>
=====
[item 1]
[item 2]
[item 4]

Update 2

You just missed initializing for $xpath variable. I've also removed getXHTML call, because it's not necessary:

$content = file_get_contents("2.html");
$page = new DOMDocument();
//$content=getXHTML($content); // no need this if you're using loadHTML
$page->loadHTML($content);
$totalPath = "//body/table[3]/tbody/tr[1]/td[4]";
$xpath = new DOMXPath($page); // creating $xpath object
$total = $xpath->query($totalPath);
echo "[",$total->length,"]";
Ivan Nevostruev
  • 28,143
  • 8
  • 66
  • 82
  • @Ivan Nevostruev: i know this works but my motive in using `loadXML` is not to display 'XML` but to use `XPATH` and `XPATH` doesnt work with `loadHTML`. have a look at [here](http://www.php.net/manual/en/domxpath.registernamespace.php#83712) – lovesh Jun 03 '11 at 01:10
  • 1
    @lovesh It looks like XPath is working with `loadHTML` just fine. See my updated answer. If you still have problems with it, then please provide content of HTML file you're parsing and XPath expressions. – Ivan Nevostruev Jun 03 '11 at 17:36
  • loadhtml works with xpath just fine. I've used it. the comment in question claims that loadhtml has problems with **XHTML** which is a different beast entirely. – Frank Farmer Jun 03 '11 at 17:45
  • @Ivan Nevostruev: i have provided content of HTML file that i am parsing and modified the code to use loadHTML and also added the XPATH i am using – lovesh Jun 04 '11 at 02:56
  • @lovesh In got you've provided there is no initialization for `$xpath` variable. But it works fine after I've added it. See update 2. – Ivan Nevostruev Jun 06 '11 at 18:49
  • @Ivan Nevostruev: i just forgot to initialize `$xpath` in this post but in my code it was there. the initialization doesnt help. – lovesh Jun 08 '11 at 09:06
  • @Ivan Nevostruev: still the same result. its showing the html but `$total->length` still evaluates to zero. can u please check the html. i think the `xpath` is not able to work on this html may be its malformed html. – lovesh Jun 08 '11 at 18:03
  • @lovesh Did you remove `getXHTML` call? It can change DOM tree structure making original xpath expression match to nothing. Also try to start with shortest xpath adding next expressions one by one, like this: `//body`, `//body/table[3]`, `//body/table[3]/tbody`,... – Ivan Nevostruev Jun 08 '11 at 18:53
0

How much have you played with the PHP Tidy options? If the error you get refers to entities (specifically &nbsp;) I wonder if setting numeric-entities "on" or playing with the value for preserve-entities would help.

Plan B: Try this. XPath worked even with poorly formed html files.

<?php 

$oldSetting = libxml_use_internal_errors( true ); 
libxml_clear_errors(); 

$html = new DOMDocument(); 
$html->loadHtmlFile(
    'myHtmlFile.html'); 

$xpath = new DOMXPath( $html ); 
$test = $xpath->query( "//div[@id='mydiv']" ); 

$div = $test->item(0);
echo $div->getAttribute('style');

libxml_clear_errors(); 
libxml_use_internal_errors( $oldSetting ); 
?>
James
  • 20,957
  • 5
  • 26
  • 41
  • @James: I am new to php's `tidy` options and the option `LIBXML_NOENT` that i have used was taken from the link you have provided. i did what i could understand at that time. can you help? – lovesh Jun 03 '11 at 01:37
  • Why do you need XML? (or xhtml?) just for XPath? – James Jun 03 '11 at 01:55
  • @James: because `XPATH` doesnt work with `loadHTML()` so i need to use `loadXML` and for thet i need `XML`. [have a look at this from php.net](http://www.php.net/manual/en/domxpath.registernamespace.php#83712) – lovesh Jun 03 '11 at 02:05
  • @James: you can also have a look at [this](http://stackoverflow.com/questions/6168558/unable-to-scrape-content-from-a-website) to know why exactly i am doing all this – lovesh Jun 03 '11 at 02:07
  • that comment from php.net says that XHTML source files don't load well using loadHTML - is that what you have XHTML source? I thought you had HTML. If you have xhtml source you don't need to tidy it and should use loadXML. – James Jun 03 '11 at 02:23
  • @James: i had HTML but i converted it to XHTML with tidy – lovesh Jun 03 '11 at 02:51
  • @James: i tried your suggestion replacing loadXML with loadHTML and it showed the output without any errors but still the xpath thing doesnt work – lovesh Jun 03 '11 at 02:57
0

the answer to the above question somewhat tricky. my original code looked something like

$xpath=new DOMXPath($page);
..
...
...
$page->loadHTML($content);
..
...
$totalPath = "//body/table[3]/tbody/tr[1]/td[4]";
$total = $xpath->query($totalPath);
...
...

what happens above is that $xpath is created on an empty document because the html is still not loaded in the Dom. so when xpath ran any query it ran the query on an empty document. now i changed the order of the 2 statements

...
...
$page->loadHTML($content);
$xpath=new DOMXPath($page);
...
...
$totalPath = "//body/table[3]/tbody/tr[1]/td[4]";
$total = $xpath->query($totalPath);

now it works because $xpath is created on a nonempty document

lovesh
  • 5,235
  • 9
  • 62
  • 93