How to get innerHTML of a node using scrapy Selector?

Question

Suppose there are some html fragments like:

<a>
   text in a
   <b>text in b</b>
   <c>text in c</c>
</a>
<a>
   <b>text in b</b>
   text in a
   <c>text in c</c>
</a>

In which I want to extract texts within tag but excluding those tags while keeping their text, for instance, the content I want to extract above would be like "text in a text in b text in c" and "text in b text in a text inc". Now I could get the nodes using scrapy Selector css() function, then how could I proceed these nodes to get what I want? Any idea would be appreciated, thank you!

score 11 · Accepted Answer · answered Feb 22 '15 at 13:48

11

Here's what I managed to do:

from scrapy.selector import Selector

sel = Selector(text = html_string)

for node in sel.css('a *::text'):
    print node.extract()

Assuming that html_string is a variable holding the html in your question, this code produces the following output:

   text in a

text in b


text in c




text in b

   text in a

text in c

The selector a *::text() matches all the text nodes which are descendents of a nodes.

answered Feb 22 '15 at 13:48

Cristian Lupascu

39,078
16
100
137

This is great, but I managed to make it by sel.css("a").extract() and then using regex to exclude those html tags – kuixiong Feb 22 '15 at 14:07
@kuixiong Great! Note that parsing HTML with regex is generally [not considered a good practice](http://stackoverflow.com/q/590747/390819). If you control that HTML and it is simple enough, go ahead and use regex. Otherwise, consider relying on specialized tools. – Cristian Lupascu Feb 22 '15 at 14:40
3

The solution collects the text, not the innerHTML. – jeroen e Oct 25 '19 at 09:13

score 10 · Answer 2 · answered Feb 23 '15 at 10:47

You can use XPath's string() function on the elements you select:

$ python
>>> import scrapy
>>> selector = scrapy.Selector(text="""<a>
...    text in a
...    <b>text in b</b>
...    <c>text in c</c>
... </a>
... <a>
...    <b>text in b</b>
...    text in a
...    <c>text in c</c>
... </a>""", type="html")
>>> for link in selector.css('a'):
...     print link.xpath('string(.)').extract()
... 
[u'\n   text in a\n   text in b\n   text in c\n']
[u'\n   text in b\n   text in a\n   text in c\n']
>>>

score 3 · Answer 3 · answered Dec 24 '18 at 02:04

3

try this

response.xpath('//a/node()').extract()

answered Dec 24 '18 at 02:04

Awais Asghar

39
2

This is the best and safest solution. – jeroen e Oct 25 '19 at 09:22

score 0 · Answer 4 · answered Jul 04 '18 at 09:44

0

in scrapy 1.5, you can use /* to get innerhtml. example:

content = response.xpath('//div[@class="viewbox"]/div[@class="content"]/*').extract_first()

answered Jul 04 '18 at 09:44

Mario7

83
1
12

This will only extract the first node in .content, use extract() with a ''.join to get the full innerhtml as a string. – jeroen e Oct 25 '19 at 09:35

How to get innerHTML of a node using scrapy Selector?

4 Answers4

Linked