4

How do I retrieve all the HTML contained inside a tag?

hxs = HtmlXPathSelector(response)
element = hxs.select('//span[@class="title"]/')

Perhaps something like:

hxs.select('//span[@class="title"]/html()')

EDIT: If I look at the documentation, I see only methods to return a new XPathSelectorList, or just the raw text inside a tag. I want to retrieve not a new list or just text, but the source code HTML inside a tag. e.g.:

<html>
    <head>
        <title></title>
    </head>
    <body>
        <div id="leexample">
            justtext
            <p class="ihatelookingforfeatures">
                sometext
            </p>
            <p class="yahc">
                sometext
            </p>
        </div>
        <div id="lenot">
            blabla
        </div>
    an awfuly long example for this.
    </body>
</html>

I want to do a method like such hxs.select('//div[@id="leexample"]/html()') that shall return me the HTML inside of it, like this:

justtext
<p class="ihatelookingforfeatures">
    sometext
</p>
<p class="yahc">
    sometext
</p>

I hope I cleared the ambiguousness around my question.

How to get the HTML from an HtmlXPathSelector in Scrapy? (perhaps a solution outside scrapy's scope?)

daaawx
  • 3,273
  • 2
  • 17
  • 16
mirandalol
  • 445
  • 1
  • 7
  • 16
  • What do you mean by *"retrieve all of the HTML"*? You need to show an example. – Wayne Jul 13 '12 at 03:36
  • my original thought was to go recursively over all teh tags inside a tag, reproduce them as html, but that's waaay to complicated, somebody must have thought about something simpler.. – mirandalol Jul 13 '12 at 03:44

6 Answers6

6

Call .extract() on your XpathSelectorList. It shall return a list of unicode strings contains the HTML content you want.

hxs.select('//div[@id="leexample"]/*').extract()

Update

# This is wrong
hxs.select('//div[@id="leexample"]/html()').extract()

/html() is not a valid scrapy selector. To extract all children, use '//div[@id="leexample"]/*' or '//div[@id="leexample"]/node()'. Note that, node() will return textNode, the result kind of like:

[u'\n   ',
 u'&lta href="image1.html">Name: My image 1 
' ]
xiaowl
  • 5,177
  • 3
  • 27
  • 28
3

Use:

//span[@class="title"]/node()

this selects all nodes (elements, text-nodes, processing-instructions and comments) that are children of any span element in the XML document whose class attribute has the value "title".

If you want to get only the children-nodes of the first such span in the document, use:

(//span[@class="title"])[1]/node()
Dimitre Novatchev
  • 240,661
  • 26
  • 293
  • 431
  • 2
    It's nice, but not what I asked. it returns a list of elements -> I need the HTML behind those elements. not nodes => plain HTML. – mirandalol Jul 13 '12 at 04:11
  • 1
    @Saga: This cannot be done with XPath -- you need within the progamming language that hosts XPath to use a particular DOM method/property (such as `OuterXML` or `InnerXml` -- or these may be named `OuterHtml` / `InnerHtml` -- or in other DOM -- `node.Save()`) – Dimitre Novatchev Jul 13 '12 at 05:07
  • 1
    Watchout: `//span[@class="title"]/node()` will fail if there are multiple classes. Join it with css selector to select elements with given class: `parent.css('.title').xpath('node()')` – Kangur Mar 06 '16 at 17:09
  • 2
    @Kangur, CSS is not needed. See this answer explaining how to determine an element has a class, that may appear with other class-names: http://stackoverflow.com/a/35354908/36305 – Dimitre Novatchev Mar 06 '16 at 17:24
1

Though late I leave this for the record.

What I do is:

html = ''.join(hxs.select('//span[@class="title"]/node()').extract())

Or if we want to match various nodes:

elements = hxs.select('//span[@class="title"]')
html = [''.join(e) for e in elements.select('./node()')]
basaundi
  • 1,725
  • 1
  • 13
  • 20
0

similiary to what @xiaowl pointed out, using hxs.select('//div[@id="leexample"]').extract() would retrieve all the HTML contents of the tag retrieven from the xPath query: //div[@id="leexample"].

so for the record, I ended up with;

post = postItem() #body = Field #/in item.py
post['body'] = hxs.select('//span[@id="edit' + self.postid+ '"]').extract()
open('logs/test.log', 'wb').write(str(post['body']))
#logs.test.log contains all the HTML inside the tag selected by the query.
mirandalol
  • 445
  • 1
  • 7
  • 16
0

It's actually not that hard as it seems. Just remove the final / of your XPath query, and use the extract() method. I've ran an example in scrapy shell, here's a shortened version:

sjaak:~ sjaakt$ scrapy shell
2012-07-19 11:06:21+0200 [scrapy] INFO: Scrapy 0.14.4 started (bot: scrapybot)
>>> fetch('http://www.nu.nl')
2012-07-19 11:06:34+0200 [default] INFO: Spider opened
2012-07-19 11:06:34+0200 [default] DEBUG: Crawled (200) <GET http://www.nu.nl> (referer: None)
>>> hxs.select("//h1").extract()
[u'<h1>    <script type="text/javascript">document.write(NU.today())</script>.\n    Het laatste nieuws het eerst op NU.nl    </h1>\n    ']
>>> 

To get only the inner content of a tag, use add /* to your XPath query. Example:

>>> hxs.select("//h1/*").extract()
[u'<script type="text/javascript">document.write(NU.today())</script>.\n    Het laatste nieuws het eerst op NU.nl    ']
Sjaak Trekhaak
  • 4,906
  • 30
  • 39
0

A bit of hacking (getting into private property _root of Selector,works in 1.0.5):

from lxml import html
def extract_inner_html(sel):
    return (sel._root.text or '') + ''.join([html.tostring(child) for child in sel._root.iterdescendants()])

def extract_inner_text(sel):
    return (''.join(sel.css('::text').extract())).strip()

Use it like:

reason = extract_inner_html(statement.css(".politic-rating .rate-reason")[0])
text = extract_inner_text(statement.css('.politic-statement')[0])
all_text = extract_inner_text(statement.css('.politic-statement'))

I've found lxml code part in this question.

Community
  • 1
  • 1
Kangur
  • 7,823
  • 3
  • 30
  • 32