How to return result as HTML with HtmlXPathSelector (Scrapy)

Question

How do I retrieve all the HTML contained inside a tag?

hxs = HtmlXPathSelector(response)
element = hxs.select('//span[@class="title"]/')

Perhaps something like:

hxs.select('//span[@class="title"]/html()')

EDIT: If I look at the documentation, I see only methods to return a new XPathSelectorList, or just the raw text inside a tag. I want to retrieve not a new list or just text, but the source code HTML inside a tag. e.g.:

<html>
    <head>
        <title></title>
    </head>
    <body>
        <div id="leexample">
            justtext
            <p class="ihatelookingforfeatures">
                sometext
            </p>
            <p class="yahc">
                sometext
            </p>
        </div>
        <div id="lenot">
            blabla
        </div>
    an awfuly long example for this.
    </body>
</html>

I want to do a method like such hxs.select('//div[@id="leexample"]/html()') that shall return me the HTML inside of it, like this:

justtext
<p class="ihatelookingforfeatures">
    sometext
</p>
<p class="yahc">
    sometext
</p>

I hope I cleared the ambiguousness around my question.

How to get the HTML from an HtmlXPathSelector in Scrapy? (perhaps a solution outside scrapy's scope?)

What do you mean by *"retrieve all of the HTML"*? You need to show an example. — Wayne, Jul 13 '12 at 03:36
my original thought was to go recursively over all teh tags inside a tag, reproduce them as html, but that's waaay to complicated, somebody must have thought about something simpler.. — mirandalol, Jul 13 '12 at 03:44

xiaowl · Answer 1 · 2012-07-19T09:30:08.677

6

Call .extract() on your XpathSelectorList. It shall return a list of unicode strings contains the HTML content you want.

hxs.select('//div[@id="leexample"]/*').extract()

Update

# This is wrong
hxs.select('//div[@id="leexample"]/html()').extract()

/html() is not a valid scrapy selector. To extract all children, use '//div[@id="leexample"]/*' or '//div[@id="leexample"]/node()'. Note that, node() will return textNode, the result kind of like:

[u'\n   ',
 u'&lta href="image1.html">Name: My image 1 
'
]

edited Jul 19 '12 at 09:30

answered Jul 13 '12 at 04:07

xiaowl

5,177
3
27
28

/html() is not supported, i'm not even sure if its valid. Scrapy will throw: `ValueError: Invalid XPath: //h1/html()` – Sjaak Trekhaak Jul 19 '12 at 09:11

score 3 · Answer 2 · answered Jul 13 '12 at 03:46

3

Use:

//span[@class="title"]/node()

this selects all nodes (elements, text-nodes, processing-instructions and comments) that are children of any span element in the XML document whose class attribute has the value "title".

If you want to get only the children-nodes of the first such span in the document, use:

(//span[@class="title"])[1]/node()

answered Jul 13 '12 at 03:46

Dimitre Novatchev

240,661
26
293
431

2

It's nice, but not what I asked. it returns a list of elements -> I need the HTML behind those elements. not nodes => plain HTML. – mirandalol Jul 13 '12 at 04:11
1

@Saga: This cannot be done with XPath -- you need within the progamming language that hosts XPath to use a particular DOM method/property (such as `OuterXML` or `InnerXml` -- or these may be named `OuterHtml` / `InnerHtml` -- or in other DOM -- `node.Save()`) – Dimitre Novatchev Jul 13 '12 at 05:07
1

Watchout: `//span[@class="title"]/node()` will fail if there are multiple classes. Join it with css selector to select elements with given class: `parent.css('.title').xpath('node()')` – Kangur Mar 06 '16 at 17:09
2

@Kangur, CSS is not needed. See this answer explaining how to determine an element has a class, that may appear with other class-names: http://stackoverflow.com/a/35354908/36305 – Dimitre Novatchev Mar 06 '16 at 17:24

score 1 · Answer 3 · answered Jan 24 '13 at 22:56

Though late I leave this for the record.

What I do is:

html = ''.join(hxs.select('//span[@class="title"]/node()').extract())

Or if we want to match various nodes:

elements = hxs.select('//span[@class="title"]')
html = [''.join(e) for e in elements.select('./node()')]

score 0 · Answer 4 · answered Jul 13 '12 at 04:30

similiary to what @xiaowl pointed out, using hxs.select('//div[@id="leexample"]').extract() would retrieve all the HTML contents of the tag retrieven from the xPath query: //div[@id="leexample"].

so for the record, I ended up with;

post = postItem() #body = Field #/in item.py
post['body'] = hxs.select('//span[@id="edit' + self.postid+ '"]').extract()
open('logs/test.log', 'wb').write(str(post['body']))
#logs.test.log contains all the HTML inside the tag selected by the query.

If xiaowl's answer was helptful, please accept/upvote his answer. — warvariuc, Jul 13 '12 at 04:49

score 0 · Answer 5 · answered Jul 19 '12 at 09:08

It's actually not that hard as it seems. Just remove the final / of your XPath query, and use the extract() method. I've ran an example in scrapy shell, here's a shortened version:

sjaak:~ sjaakt$ scrapy shell
2012-07-19 11:06:21+0200 [scrapy] INFO: Scrapy 0.14.4 started (bot: scrapybot)
>>> fetch('http://www.nu.nl')
2012-07-19 11:06:34+0200 [default] INFO: Spider opened
2012-07-19 11:06:34+0200 [default] DEBUG: Crawled (200) <GET http://www.nu.nl> (referer: None)
>>> hxs.select("//h1").extract()
[u'<h1>    <script type="text/javascript">document.write(NU.today())</script>.\n    Het laatste nieuws het eerst op NU.nl    </h1>\n    ']
>>>

To get only the inner content of a tag, use add /* to your XPath query. Example:

>>> hxs.select("//h1/*").extract()
[u'<script type="text/javascript">document.write(NU.today())</script>.\n    Het laatste nieuws het eerst op NU.nl    ']

score 0 · Answer 6 · edited May 23 '17 at 11:54

A bit of hacking (getting into private property _root of Selector,works in 1.0.5):

from lxml import html
def extract_inner_html(sel):
    return (sel._root.text or '') + ''.join([html.tostring(child) for child in sel._root.iterdescendants()])

def extract_inner_text(sel):
    return (''.join(sel.css('::text').extract())).strip()

Use it like:

reason = extract_inner_html(statement.css(".politic-rating .rate-reason")[0])
text = extract_inner_text(statement.css('.politic-statement')[0])
all_text = extract_inner_text(statement.css('.politic-statement'))

I've found lxml code part in this question.

How to return result as HTML with HtmlXPathSelector (Scrapy)

6 Answers6

Update