lxml etree HTML parser changes order of nodes ( inside
)

Question

I'm currently facing an issue where I can't explain the etree behaviour. Following code demonstrates the issue I am facing. I want to parse an HTML string as illustrated below, change the attribute of an element and reprint the HTML when done.

from lxml import etree
from io import StringIO, BytesIO

string = "<p><center><code>git clone https://github.com/AlexeyAB/darknet.git</code></center></p>"
parser = etree.HTMLParser()
test = etree.fromstring(string, parser)
print(etree.tostring(test, pretty_print=True, method="html")

I get this output:

<html><body>
<p></p>
<center><code>git clone https://github.com/AlexeyAB/darknet.git</code></center>
</body></html>

As you can see (let's ignore the <html> and <body> tags etree adds), the order of the nodes has been changed! The  tag that used to wrap the <center> tag, now loses its content, and that content gets added after the  tag closes. Eh?

When I omit the <center> tag, all of a sudden the parsing is done right:

from lxml import etree
from io import StringIO, BytesIO

string = "<p><code>git clone https://github.com/AlexeyAB/darknet.git</code></p>"
parser = etree.HTMLParser()
test = etree.fromstring(string, parser)
print(etree.tostring(test, pretty_print=True, method="html"))

With correct output:

<html><body><p><code>git clone https://github.com/AlexeyAB/darknet.git</code></p></body></html>

Am I doing something wrong here? I have to use the HTML parser because I get a lot of parsing errors when not using it. I also can't change the order of the  and <center> tags, as I read them this way.

The same happens if you use `
` instead of ``. Perhaps this is some subtle hint from lxml that you should not use `
` inside `
` (`
` is equivalent to ``). See http://stackoverflow.com/a/2226592/407651. — mzjn, May 17 '17 at 15:13
@panatale1, I edited the code to include the definition of `parser`, it got lost when copying the code (was a few lines higher), shouldn't affect the issue at hand though I think — Nils Tijtgat, May 18 '17 at 08:10
@mzjn, thank you for clearing that up. So it looks like lxml is trying to do the good thing here, correcting a situation that shouldn't have happened in the first place? — Nils Tijtgat, May 18 '17 at 08:12

Tomalak · Answer 1 · 2017-05-18T11:42:40.113

1

<center> is a block level element.

 cannot legally contain block level elements.

Therefore the parser closes the  when it encounters <center>.

Use valid HTML - or an XML parser, which does not care about HTML rules (but in exchange can't deal with some of the HTML specifics, like most named entities, such as   or unclosed/self-closing tags).

Centering content has been done with CSS for ages now, there is no reason to use <center> anymore (and, in fact, it's deprecated). But it still works, and if you insist on using it, switch the nesting.

<center><p><code>git clone https://github.com/AlexeyAB/darknet.git</code></p></center>

edited May 18 '17 at 11:42

answered May 18 '17 at 08:53

Tomalak

332,285
67
532
628

I thought block level...? At least that would explain the parser's behavior. – Tomalak May 18 '17 at 11:37
I think the problem is the use of `` (or `
`) as a child of `
`.
– mzjn May 18 '17 at 11:40
Ooh, `` is the block level element. Of course! Thanks for the correction. :) – Tomalak May 18 '17 at 11:42

lxml etree HTML parser changes order of nodes ( inside )

1 Answers1

lxml etree HTML parser changes order of nodes ( inside
)