30

I have a raw html string that I want to convert to scrapy HTML response object so that I can use the selectors css and xpath, similar to scrapy's response. How can I do it?

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
yayu
  • 7,758
  • 17
  • 54
  • 86

3 Answers3

42

First of all, if it is for debugging or testing purposes, you can use the Scrapy shell:

$ cat index.html
<div id="test">
    Test text
</div>

$ scrapy shell index.html
>>> response.xpath('//div[@id="test"]/text()').extract()[0].strip()
u'Test text'

There are different objects available in the shell during the session, like response and request.


Or, you can instantiate an HtmlResponse class and provide the HTML string in body:

>>> from scrapy.http import HtmlResponse
>>> response = HtmlResponse(url="my HTML string", body='<div id="test">Test text</div>', encoding='utf-8')
>>> response.xpath('//div[@id="test"]/text()').extract()[0].strip()
u'Test text'
Umair Ayub
  • 19,358
  • 14
  • 72
  • 146
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • thanks alecxe, I am using Selenium becuase of some ajaxiness. I want to convert driver.page_source into the same object as resposne so that I can reuse some extractors (using css and xpath selectors) instead of having to resort to lxml. I think your second option is the one I need. – yayu Dec 05 '14 at 20:14
  • 1
    @yayu then, you probably don't need to create an HTML Response, but, rather a `Selector`, see http://stackoverflow.com/questions/18836286/scraping-with-scrapy-and-selenium and http://stackoverflow.com/questions/17975471/selenium-with-scrapy-for-dynamic-page. Might help. Thanks. – alecxe Dec 05 '14 at 20:16
  • @yayu and, as a side note, there can be a point there you would have much more selenium than scrapy in the project - at this point, think about whether there is any point in scrapy at all. – alecxe Dec 05 '14 at 20:17
  • @yayu also [`scrapyjs`](https://github.com/scrapinghub/scrapyjs) might be worth trying - may be you could avoid using `selenium`. – alecxe Dec 05 '14 at 20:18
  • @alecxe is there any way to set the meta attribute for this `response` object. I know `response meta` is an alias for the `request meta` object. But since there is no `request` associated with this `response`, is there any workaround? – Kashyap Oct 08 '17 at 04:15
  • 6
    as of today, HtmlResponse object requires another argument, encoding. You can do it like: HtmlResponse(url='http://scrapy.org', body=u'some body', encoding='utf-8') – Mehmet Kurtipek May 08 '18 at 21:38
  • On linux, `scrapy shell index.html` does not work and it's well documented in here https://docs.scrapy.org/en/latest/topics/shell.html#launch-the-shell. Use `scrapy shell ./index.html` instead. – BcK Mar 17 '21 at 00:40
17

alecxe's answer is right, but this is the correct way to instantiate a Selector from text in scrapy:

>>> from scrapy.selector import Selector
>>> body = '<html><body><span>good</span></body></html>'
>>> Selector(text=body).xpath('//span/text()').get()

'good'
Aminah Nuraini
  • 18,120
  • 8
  • 90
  • 108
Mohsen Mahmoodi
  • 331
  • 2
  • 8
1

You can import native scrapy selector Selector and declare the html string as the text arg to be parsed.

from scrapy.selector import Selector


def get_list_text_from_html_string(html_string):
    html_item = Selector(text=html_string)
    elements = [_li.get() for _li in html_item.css('ul > li::text')]
    return elements

list_html_string = '<ul class="teams">\n<li>Bayern M.</li>\n<li>Palmeiras</li>\n<li>Liverpool</li>\n<li>Flamengo</li></ul>'
print(get_list_text_from_html_string(list_html_string))
>>> ['Bayern M.', 'Tigres', 'Liverpool', 'Flamengo']
Kenny Aires
  • 1,338
  • 12
  • 16