I'm using lxml
-v 3.7.3, i'm trying to parse a list of html parts of code, change the value of some specific node attributes and then save them.
Basically this is my code:
from lxml import etree
# parsing of the html fragment
parser = etree.HTMLParser()
xhtml = etree.parse(io.StringIO('<a href="http://www.example.com">hello world</a>'), parser)
root = xhtml.getroot()
# change href from example.com to otherexample.com
[...]
# render the elements tree as a string
etree.tostring(root, encoding='unicode', method='html')
Output:
<html><body><a href='http://www.otherexample.com'>hello world</a></body></html>
The issue i have is that the tags <html>
and <body>
are always added to the rendered html.
What i need is the string <a href='http://www.otherexample.com'>hello world</a>
because what i'm processing is part of an HTML document, not the all document.
I know i could just strip out the surrounding tags but it looks very hacky.
Any suggestions?
EDIT:
The markup i have in input is somewhere broken or not a proper HTML. This is an example:
<p><a href="http://www.example.com">hello world</a><script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script><p>
Expected Output:
<p><a href="http://www.otherexample.com">hello world</a><script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script><p>