Extract html fragments using lxml

Question

I'm using lxml -v 3.7.3, i'm trying to parse a list of html parts of code, change the value of some specific node attributes and then save them.

Basically this is my code:

from lxml import etree

# parsing of the html fragment
parser = etree.HTMLParser()
xhtml = etree.parse(io.StringIO('<a href="http://www.example.com">hello world</a>'), parser)
root = xhtml.getroot()

# change href from example.com to otherexample.com 
[...]

# render the elements tree as a string 
etree.tostring(root, encoding='unicode', method='html')

Output: 
<html><body><a href='http://www.otherexample.com'>hello world</a></body></html>

The issue i have is that the tags <html> and <body> are always added to the rendered html.

What i need is the string <a href='http://www.otherexample.com'>hello world</a> because what i'm processing is part of an HTML document, not the all document.

I know i could just strip out the surrounding tags but it looks very hacky.

Any suggestions?

EDIT:

The markup i have in input is somewhere broken or not a proper HTML. This is an example:

<p><a href="http://www.example.com">hello world</a><script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script><p>

Expected Output:

<p><a href="http://www.otherexample.com">hello world</a><script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script><p>

@mzjn that's a good one(+1 because in the first example i did it would fix the issue), but the XML parser is not good in my case because it's not tough enough agains bad/broken HTML piece of codes. Look at the edit. — Riccardo, Apr 28 '17 at 13:11
What is your expected/desired output for the broken markup you provided? — supersam654, Apr 28 '17 at 14:56
@supersam654 just edited the original question, please check the changes. — Riccardo, Apr 28 '17 at 14:59
I've searched around a bit and can't find a way to get lxml to spit back your original markup. I'd recommend fixing your markup before touching individual pieces. Also, take a look at http://stackoverflow.com/questions/16498805/parse-html-body-fragment-in-lxml for getting rid of the `` and `` tags. — supersam654, Apr 28 '17 at 15:14

Extract html fragments using lxml

0 Answers0