1

I'm currently facing an issue where I can't explain the etree behaviour. Following code demonstrates the issue I am facing. I want to parse an HTML string as illustrated below, change the attribute of an element and reprint the HTML when done.

from lxml import etree
from io import StringIO, BytesIO

string = "<p><center><code>git clone https://github.com/AlexeyAB/darknet.git</code></center></p>"
parser = etree.HTMLParser()
test = etree.fromstring(string, parser)
print(etree.tostring(test, pretty_print=True, method="html")

I get this output:

<html><body>
<p></p>
<center><code>git clone https://github.com/AlexeyAB/darknet.git</code></center>
</body></html>

As you can see (let's ignore the <html> and <body> tags etree adds), the order of the nodes has been changed! The <p> tag that used to wrap the <center> tag, now loses its content, and that content gets added after the </p> tag closes. Eh?

When I omit the <center> tag, all of a sudden the parsing is done right:

from lxml import etree
from io import StringIO, BytesIO

string = "<p><code>git clone https://github.com/AlexeyAB/darknet.git</code></p>"
parser = etree.HTMLParser()
test = etree.fromstring(string, parser)
print(etree.tostring(test, pretty_print=True, method="html"))

With correct output:

<html><body><p><code>git clone https://github.com/AlexeyAB/darknet.git</code></p></body></html>

Am I doing something wrong here? I have to use the HTML parser because I get a lot of parsing errors when not using it. I also can't change the order of the <p> and <center> tags, as I read them this way.

smci
  • 32,567
  • 20
  • 113
  • 146
Nils Tijtgat
  • 206
  • 4
  • 9
  • 1
    Where is `parser` coming from? – panatale1 May 17 '17 at 14:13
  • 1
    The same happens if you use `
    ` instead of `
    `. Perhaps this is some subtle hint from lxml that you should not use `
    ` inside `

    ` (`

    ` is equivalent to `
    `). See http://stackoverflow.com/a/2226592/407651.
    – mzjn May 17 '17 at 15:13
  • @panatale1, I edited the code to include the definition of `parser`, it got lost when copying the code (was a few lines higher), shouldn't affect the issue at hand though I think – Nils Tijtgat May 18 '17 at 08:10
  • 1
    @mzjn, thank you for clearing that up. So it looks like lxml is trying to do the good thing here, correcting a situation that shouldn't have happened in the first place? – Nils Tijtgat May 18 '17 at 08:12

1 Answers1

1

<center> is a block level element.

<p> cannot legally contain block level elements.

Therefore the parser closes the <p> when it encounters <center>.

Use valid HTML - or an XML parser, which does not care about HTML rules (but in exchange can't deal with some of the HTML specifics, like most named entities, such as &nbsp; or unclosed/self-closing tags).

Centering content has been done with CSS for ages now, there is no reason to use <center> anymore (and, in fact, it's deprecated). But it still works, and if you insist on using it, switch the nesting.

<center><p><code>git clone https://github.com/AlexeyAB/darknet.git</code></p></center>
Tomalak
  • 332,285
  • 67
  • 532
  • 628