Get divs HTML content with lxml

Question

I'm using python and lxml to get the the content of div.article from a load of links. I want the actual html markup of the div. But so far I've only been able to get the text_content() of the links which strips out the markup.

doc = html.fromstring(doc_text)

article = doc.cssselect("div.article")

if len(article) > 0:
    text = article[0].text_content()

    data = {
        'product':product,
        'content': text,
    }

Can anyone help me to get the markup of article[0]?

Thanks

score 4 · Accepted Answer · answered Mar 11 '13 at 16:46

4

You can just use the iteration features of the node and build your string that way.

def innerHTML(node): 
    buildString = ''
    for child in node:
        buildString += html.tostring(child)
    return buildString

answered Mar 11 '13 at 16:46

Spen-ZAR

818
6
19

Get divs HTML content with lxml

1 Answers1