I'm currently facing an issue where I can't explain the etree behaviour. Following code demonstrates the issue I am facing. I want to parse an HTML string as illustrated below, change the attribute of an element and reprint the HTML when done.
from lxml import etree
from io import StringIO, BytesIO
string = "<p><center><code>git clone https://github.com/AlexeyAB/darknet.git</code></center></p>"
parser = etree.HTMLParser()
test = etree.fromstring(string, parser)
print(etree.tostring(test, pretty_print=True, method="html")
I get this output:
<html><body>
<p></p>
<center><code>git clone https://github.com/AlexeyAB/darknet.git</code></center>
</body></html>
As you can see (let's ignore the <html>
and <body>
tags etree adds), the order of the nodes has been changed! The <p>
tag that used to wrap the <center>
tag, now loses its content, and that content gets added after the </p>
tag closes. Eh?
When I omit the <center>
tag, all of a sudden the parsing is done right:
from lxml import etree
from io import StringIO, BytesIO
string = "<p><code>git clone https://github.com/AlexeyAB/darknet.git</code></p>"
parser = etree.HTMLParser()
test = etree.fromstring(string, parser)
print(etree.tostring(test, pretty_print=True, method="html"))
With correct output:
<html><body><p><code>git clone https://github.com/AlexeyAB/darknet.git</code></p></body></html>
Am I doing something wrong here? I have to use the HTML parser because I get a lot of parsing errors when not using it. I also can't change the order of the <p>
and <center>
tags, as I read them this way.
` (`