1
<div id="content">
   foo <br/>
   bar <br/>
</div>

I am trying to get the inner text of the content div above with the following:

response.xpath('//div[@id ="content"]').extract()

this gives me the following:

[u'<div id="content"> foo<br/>bar <br/></div>

How can I get:

foo<br/>bar</br>
Ry-
  • 218,210
  • 55
  • 464
  • 476
DarthVader
  • 52,984
  • 76
  • 209
  • 300

2 Answers2

0

lxml is impressively inconvenient in many places – getting an element’s inner HTML is one of them. Adapted from an answer by lormus:

from lxml import html

def inner_html(element):
    return (
        (element.text or '') +
        ''.join(html.tostring(child, encoding='unicode') for child in element)
    )

In use:

>>> from scrapy.selector import Selector
>>> response = Selector(text="""
... <div id="content">
...    foo <br/>
...    bar <br/>
... </div>
... """)
>>> inner_html(response.css('#content')[0].root)
'\n   foo <br>\n   bar <br>\n'
Ry-
  • 218,210
  • 55
  • 464
  • 476
0

Try this:

''.join(map(methodcaller('strip'), response.xpath('//div[@id ="content"]/node()').extract()))
# output: u'foo<br>bar<br>'

Please note that this changes the <br /> to <br> by lxml but if you don't need those inner tags, you could do this:

response.xpath('normalize-space(//div[@id ="content"])').extract_first()
# output: u'foo bar'
Wilfredo
  • 1,548
  • 1
  • 9
  • 9