Perl's HTML::Element - dumping just the descendants as HTML

Question

I'm having trouble trying to output the contents of a matched node that I'm parsing:

<div class="description">some text <br/>more text<br/></div>

I'm using HTML::TreeBuilder::XPath to find the node (there's only one div with this class):

my $description = $tree->findnodes('//div[@class="description"]')->[0];

It finds the node (returned as a HTML::Element I believe) but $description->as_HTML includes the element itself too - I just want everything contained inside the element as HTML:

some text <br/>more text<br/>

I can obviously regex strip it out, but that feels messy and I'm sure I'm just missing a function somewhere to do it?

score 0 · Answer 1 · answered Feb 06 '13 at 13:22

0

Try doing this :

my $description = $tree->findnodes('//div[@class="description"]/text()')->[0];

This is a Xpath trick.

answered Feb 06 '13 at 13:22

Gilles Quénot

173,512
41
224
223

That returns an object of type HTML::TreeBuilder::XPath::TextNode which doesn't have the 'as_HMTL' method (and I can't seem to find any docs as to what it does provide) – AndyC Feb 06 '13 at 13:35

Jens Erat · Answer 2 · 2013-02-06T15:28:26.577

0

Use ./node() to fetch all subnodes including text and elements.

my $description = $tree->findnodes('//div[@class="description"]/node()');

edited Feb 06 '13 at 15:28

answered Feb 06 '13 at 13:52

Jens Erat

37,523
16
80
96

It has the same issue as using text(), the returned object is HTML::TreeBuilder::XPath::TextNode and I'm not sure what to do with it. – AndyC Feb 06 '13 at 14:04
This call will return *multiple* nodes (all nodes contained), so it should be a container containing all the elements. It will return some list or a `Tree::XPathEngine::NodeSet` object in scalar mode (what you're forcing it). You'll probably have to iterate over the result in some way. Oh and have a look at the `->[0]` in the end, I guess it's probably wrong here (because you want all nodes, not only the first). I removed it from my answer. – Jens Erat Feb 06 '13 at 15:28
Yeah looking at the list returned its a mixture of `HTML::TreeBuilder::XPath::TextNode` and `HTML::Element`, which are lists themselves. It'd be extremely fiddly and annoying just to accomplish what I want, so at this rate I may as well just get rid of the parent tag with regex! – AndyC Feb 06 '13 at 17:55
If you'll apply regex, you should be happy with a string anyway? You know [`findnodes_as_string`](http://search.cpan.org/~mirod/HTML-TreeBuilder-XPath-0.14/lib/HTML/TreeBuilder/XPath.pm#findnodes_as_string_($path))? – Jens Erat Feb 06 '13 at 19:10

Perl's HTML::Element - dumping just the descendants as HTML

2 Answers2