xpath to get only the content not the self tag

Question

<div id="content">
   foo <br/>
   bar <br/>
</div>

I am trying to get the inner text of the content div above with the following:

response.xpath('//div[@id ="content"]').extract()

this gives me the following:

[u'<div id="content"> foo<br/>bar <br/></div>

How can I get:

foo<br/>bar</br>

What language are you using to call response.xpath and .extract()? — danjuggler, Nov 08 '17 at 18:40

score 0 · Answer 1 · answered Nov 08 '17 at 19:11

lxml is impressively inconvenient in many places – getting an element’s inner HTML is one of them. Adapted from an answer by lormus:

from lxml import html

def inner_html(element):
    return (
        (element.text or '') +
        ''.join(html.tostring(child, encoding='unicode') for child in element)
    )

In use:

>>> from scrapy.selector import Selector
>>> response = Selector(text="""
... <div id="content">
...    foo <br/>
...    bar <br/>
... </div>
... """)
>>> inner_html(response.css('#content')[0].root)
'\n   foo <br>\n   bar <br>\n'

Wilfredo · Accepted Answer · 2017-11-08T23:30:20.433

0

Try this:

''.join(map(methodcaller('strip'), response.xpath('//div[@id ="content"]/node()').extract()))
# output: u'foo<br>bar<br>'

Please note that this changes the <br /> to <br> by lxml but if you don't need those inner tags, you could do this:

response.xpath('normalize-space(//div[@id ="content"])').extract_first()
# output: u'foo bar'

edited Nov 08 '17 at 23:30

answered Nov 08 '17 at 22:09

Wilfredo

1,548
1
9
9

This will lose the ability to distinguish elements from text (e.g. `<b>`). – Ry- Nov 12 '17 at 07:34

xpath to get only the content not the self tag

2 Answers2