11

Suppose there are some html fragments like:

<a>
   text in a
   <b>text in b</b>
   <c>text in c</c>
</a>
<a>
   <b>text in b</b>
   text in a
   <c>text in c</c>
</a>

In which I want to extract texts within tag but excluding those tags while keeping their text, for instance, the content I want to extract above would be like "text in a text in b text in c" and "text in b text in a text inc". Now I could get the nodes using scrapy Selector css() function, then how could I proceed these nodes to get what I want? Any idea would be appreciated, thank you!

kuixiong
  • 505
  • 1
  • 4
  • 16

4 Answers4

11

Here's what I managed to do:

from scrapy.selector import Selector

sel = Selector(text = html_string)

for node in sel.css('a *::text'):
    print node.extract()

Assuming that html_string is a variable holding the html in your question, this code produces the following output:

   text in a

text in b


text in c




text in b

   text in a

text in c

The selector a *::text() matches all the text nodes which are descendents of a nodes.

Cristian Lupascu
  • 39,078
  • 16
  • 100
  • 137
  • This is great, but I managed to make it by sel.css("a").extract() and then using regex to exclude those html tags – kuixiong Feb 22 '15 at 14:07
  • @kuixiong Great! Note that parsing HTML with regex is generally [not considered a good practice](http://stackoverflow.com/q/590747/390819). If you control that HTML and it is simple enough, go ahead and use regex. Otherwise, consider relying on specialized tools. – Cristian Lupascu Feb 22 '15 at 14:40
  • 3
    The solution collects the text, not the innerHTML. – jeroen e Oct 25 '19 at 09:13
10

You can use XPath's string() function on the elements you select:

$ python
>>> import scrapy
>>> selector = scrapy.Selector(text="""<a>
...    text in a
...    <b>text in b</b>
...    <c>text in c</c>
... </a>
... <a>
...    <b>text in b</b>
...    text in a
...    <c>text in c</c>
... </a>""", type="html")
>>> for link in selector.css('a'):
...     print link.xpath('string(.)').extract()
... 
[u'\n   text in a\n   text in b\n   text in c\n']
[u'\n   text in b\n   text in a\n   text in c\n']
>>> 
paul trmbrth
  • 20,518
  • 4
  • 53
  • 66
3

try this

response.xpath('//a/node()').extract()
0

in scrapy 1.5, you can use /* to get innerhtml. example:

content = response.xpath('//div[@class="viewbox"]/div[@class="content"]/*').extract_first()
Mario7
  • 83
  • 1
  • 12
  • This will only extract the first node in .content, use extract() with a ''.join to get the full innerhtml as a string. – jeroen e Oct 25 '19 at 09:35