Equivalent to InnerHTML when using lxml.html to parse HTML

Question

I'm working on a script using lxml.html to parse web pages. I have done a fair bit of BeautifulSoup in my time but am now experimenting with lxml due to its speed.

I would like to know what the most sensible way in the library is to do the equivalent of Javascript's InnerHtml - that is, to retrieve or set the complete contents of a tag.

<body>
<h1>A title</h1>
<p>Some text</p>
</body>

InnerHtml is therefore:

<h1>A title</h1>
<p>Some text</p>

I can do it using hacks (converting to string/regexes etc) but I'm assuming that there is a correct way to do this using the library which I am missing due to unfamiliarity. Thanks for any help.

EDIT: Thanks to pobk for showing me the way on this so quickly and effectively. For anyone trying the same, here is what I ended up with:

from lxml import html
from cStringIO import StringIO
t = html.parse(StringIO(
"""<body>
<h1>A title</h1>
<p>Some text</p>
Untagged text
<p>
Unclosed p tag
</body>"""))
root = t.getroot()
body = root.body
print (element.text or '') + ''.join([html.tostring(child) for child in body.iterdescendants()])

Note that the lxml.html parser will fix up the unclosed tag, so beware if this is a problem.

You may consider using `encoding='unicode'` in html.tostring in order to get nice Unicode strings rather than a horrible byte soup Python hates. — zopieux, Mar 03 '12 at 14:43
this isn't quite right either; if `element.text` contains any metacharacters, they'll come out literally. you **must** HTML-escape it yourself. — Eevee, May 08 '13 at 01:32

score 16 · Answer 1 · edited Jun 16 '23 at 13:54

16

Sorry for bringing this up again, but I've been looking for a solution and yours contains a bug:

<body>This text is ignored
<h1>Title</h1><p>Some text</p></body>

Text directly under the root element is ignored. I ended up doing this:

(body.text or '') +\
''.join([html.tostring(child) for child in body.iterchildren()])

edited Jun 16 '23 at 13:54

Benjamin Loison

3,782
4
16
33

answered Jun 18 '11 at 12:46

lormus

509
2
7

Thanks lormus, you are correct - I have edited the answer above, good spot. – somewhatoff Jun 27 '11 at 17:40
1

`body.text` should be escaped – andreymal Jul 31 '18 at 13:11

score 12 · Accepted Answer · answered May 25 '11 at 11:29

12

You can get the children of an ElementTree node using the getchildren() or iterdescendants() methods of the root node:

>>> from lxml import etree
>>> from cStringIO import StringIO
>>> t = etree.parse(StringIO("""<body>
... <h1>A title</h1>
... <p>Some text</p>
... </body>"""))
>>> root = t.getroot()
>>> for child in root.iterdescendants(),:
...  print etree.tostring(child)
...
<h1>A title</h1>

<p>Some text</p>

This can be shorthanded as follows:

print ''.join([etree.tostring(child) for child in root.iterdescendants()])

answered May 25 '11 at 11:29

pobk

9,435
1
17
12

7

Note that you'll want to call .iterchildren() and not .iterdescendants() -- the latter will cause severe duplication of content, as .tostring() will descend itself. For example, see the duplication of the 'two' and 'four' nodes: https://gist.github.com/1290412 – arantius Oct 16 '11 at 01:50
15

Note that regardless of whether you use `iterchildren` or `iterdescendants`, both of these solutions are incorrect and will completely ignore text nodes contained by the parent element. See http://stackoverflow.com/questions/4624062/get-all-text-inside-a-tag-in-lxml for a better answer. – larsks Jun 26 '13 at 14:50

Saurabh Chandra Patel · Answer 3 · 2015-01-16T04:54:25.603

4

import lxml.etree as ET

     body = t.xpath("//body");
     for tag in body:
         h = html.fromstring( ET.tostring(tag[0]) ).xpath("//h1");
         p = html.fromstring(  ET.tostring(tag[1]) ).xpath("//p");             
         htext = h[0].text_content();
         ptext = h[0].text_content();

you can also use .get('href') for a tag and .attrib for attribute ,

here tag no is hardcoded but you can also do this dynamic

edited Jan 16 '15 at 04:54

answered Jan 16 '15 at 04:46

Saurabh Chandra Patel

12,712
6
88
78

1

I need to remove tags like ``, `` or ``. This `test_content` saves my day. – Ziyuan Jan 19 '15 at 16:10
Should be `ptext = p[0].text_content();` (If I could edit single char I'd also reduce the double space before ` ET.tostring(tag[1])` to a single.) – Alexx Roche Oct 02 '18 at 09:28

score 2 · Answer 4 · answered Nov 14 '19 at 17:49

Here is a Python 3 version:

from xml.sax import saxutils
from lxml import html

def inner_html(tree):
    """ Return inner HTML of lxml element """
    return (saxutils.escape(tree.text) if tree.text else '') + \
        ''.join([html.tostring(child, encoding=str) for child in tree.iterchildren()])

Note that this includes escaping of the initial text as recommended by andreymal -- this is needed to avoid tag injection if you're working with sanitized HTML!

score 2 · Answer 5 · edited Dec 20 '22 at 16:12

I find none of the answers satisfying, some are even in Python 2. So I add a one-liner solution that produces innerHTML-like output and works with Python 3:

from lxml import etree, html

# generate some HTML element node
node = html.fromstring("""<container>
Some random text <b>bold <i>italic</i> yeah</b> no yeah
<!-- comment blah blah -->  <img src='gaga.png' />
</container>""")

# compute inner HTML of element
innerHTML = "".join([
    str(c) if type(c)==etree._ElementUnicodeResult 
    else html.tostring(c, with_tail=False).decode() 
    for c in node.xpath("node()")
]).strip()

The result will be:

'Some random text <b>bold <i>italic</i> yeah</b> no yeah\n<!-- comment blah blah -->  <img src="gaga.png">'

What it does: The xpath delivers all node children (text, elements, comments). The list comprehension produces a list of the text contents of the text nodes and HTML content of element nodes. Those are then joined into a single string. If you want to get rid of comments, use *|text() instead of node() for xpath.

Equivalent to InnerHTML when using lxml.html to parse HTML

5 Answers5

Linked

Related