0

I am trying to extract a complete table including the HTML tags, with XPath, that I can store in a variable, do a bit of string replacement on, then echo directly to the screen. I have found numerous posts on getting the text out of the table but I want to retain the HTML formatting since I am just going to display it (after minor modification).

At present I am extracting the table using string functions stristr, substr etc. but I would prefer to use XPath.

I can display the contents of the table with the following but it just displays the table TD fields with no formatting. It also does not store it in a variable that I can manipulate.

$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$arr = $xpath->query('//table');
foreach($arr as $el) {
   echo $el->textContent;

I tried this but got no output:

$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$arr = $xpath->query('//table');
echo $arr->saveHTML();
user2605793
  • 439
  • 1
  • 8
  • 19
  • The `textContent` attribute (http://www.php.net/manual/en/class.domnode.php#domnode.props.textcontent) just returns the text content of the `` element and its descendants; i.e. no HTML start/end tags or HTML attributes. I don't know whether DOMNode or its ilk offer a way to extract the "outerHTML" (serialization including markup) of a node.
    – LarsH Oct 10 '13 at 21:23
  • I think I might need to use saveHTML. Just trying to work out how it works. – user2605793 Oct 10 '13 at 21:27
  • I think saveHTML wraps its own etc around the data. I just used xpath to remove all of that! – user2605793 Oct 10 '13 at 21:30
  • possible duplicate of [PHP + DOMDocument: outerHTML for element?](http://stackoverflow.com/questions/5404941/php-domdocument-outerhtml-for-element) – IMSoP Oct 10 '13 at 21:46
  • I am a newbie. Please be kind. I have looked at that other answer but am not quite sure how to use it. Would my echo above become: echo $domDocument->saveHtml($el); – user2605793 Oct 10 '13 at 22:30
  • I tried the code from here: http://phpfiddle.org/main/code/rc1-00s but I got multiple errors: PHP Warning: DOMDocument::saveHTML() expects exactly 0 parameters, 1 given in /home3/austcemi/public_html/curl5.php on line 95 – user2605793 Oct 10 '13 at 22:46
  • The system I am using is PHP 5.2.17. I note that the solution proposed on that page requires PHP 5.3.6. Upgrading PHP is not an option available to me. – user2605793 Oct 10 '13 at 22:50

1 Answers1

1

Use DOMNode::C14N():

foreach($arr as $el) {
   echo $el->C14N();
Jens Erat
  • 37,523
  • 16
  • 80
  • 96
  • That sort of worked thanks, only the output has capital A with a tilde on top in lots of places: – user2605793 Oct 10 '13 at 22:01
  • 22422/1901 SMITH WILLIAM S WILLIAM A MARGARET GRENFELL Â Buy Now 35/1901 SMITH WILLIAM G THOMAS CATHERINE SYDNEY Â Buy Now 11473/1901 SMITH WILLIAM Â ALICE E BURWOOD Â Buy Now 13968/1901 SMITH WILLIAM M WILLIAM W BRIDGET C LIVERPOOL Â Buy Now – user2605793 Oct 10 '13 at 22:03
  • 1
    This seems like some encoding problem. You probably either need to load the files using the right encoding or transform it afterwards. What character is in the input which should have been here? By the way, this is a circumflex, not a tilde. Doesn't change anything to the problem though. – Jens Erat Oct 10 '13 at 22:05
  • Not encoding issue. I believe from elsewhere that C14N tries to correct the HTML.Here is part of it. It is   on this one. GRENFELL Â GRENFELL   – user2605793 Oct 10 '13 at 22:15
  • The first bit is my output. The second bit is from the original page I scraped from. – user2605793 Oct 10 '13 at 22:16
  • 1
    @OP: What makes you think it's not an encoding issue? nbsp coming out as  is a known problem due to UTF-8 being interpreted as ISO8859-1: e.g. http://osdir.com/ml/text.xml.xalan.java.user/2004-04/msg00037.html I think Jens has answered your question successfully... the encoding issue is a separate thing. – LarsH Oct 11 '13 at 01:04
  • 1
    @LarsH Interesting, didn't know of that yet. There's even a [bunch of on-site resources on that problem](http://stackoverflow.com/search?q=%C3%82+nbsp). – Jens Erat Oct 11 '13 at 07:56
  • 1
    Thanks for the help. I have simply issued the following before passing to xpath and it now works fine: $result = str_replace(' ', " ", $result); – user2605793 Oct 11 '13 at 10:06
  • @OP: glad you were able to resolve your problem. You are OK with the fact that your non-breaking spaces are being replaced with breakable spaces? Alternatively, you could do a replace *after* the xpath/C14N that replaces `Â` with ` `. But you'd have to make sure the encoding is right in your php script. Jens, I didn't know about it either, just googled. :-) – LarsH Oct 11 '13 at 14:11