scrapy: convert html string to HtmlResponse object

Question

I have a raw html string that I want to convert to scrapy HTML response object so that I can use the selectors css and xpath, similar to scrapy's response. How can I do it?

score 42 · Accepted Answer · edited May 14 '19 at 09:11

42

First of all, if it is for debugging or testing purposes, you can use the Scrapy shell:

$ cat index.html
<div id="test">
    Test text
</div>

$ scrapy shell index.html
>>> response.xpath('//div[@id="test"]/text()').extract()[0].strip()
u'Test text'

There are different objects available in the shell during the session, like response and request.

Or, you can instantiate an HtmlResponse class and provide the HTML string in body:

>>> from scrapy.http import HtmlResponse
>>> response = HtmlResponse(url="my HTML string", body='<div id="test">Test text</div>', encoding='utf-8')
>>> response.xpath('//div[@id="test"]/text()').extract()[0].strip()
u'Test text'

edited May 14 '19 at 09:11

Umair Ayub

19,358
14
72
146

answered Dec 05 '14 at 20:04

alecxe

462,703
120
1,088
1,195

thanks alecxe, I am using Selenium becuase of some ajaxiness. I want to convert driver.page_source into the same object as resposne so that I can reuse some extractors (using css and xpath selectors) instead of having to resort to lxml. I think your second option is the one I need. – yayu Dec 05 '14 at 20:14
1

@yayu then, you probably don't need to create an HTML Response, but, rather a `Selector`, see http://stackoverflow.com/questions/18836286/scraping-with-scrapy-and-selenium and http://stackoverflow.com/questions/17975471/selenium-with-scrapy-for-dynamic-page. Might help. Thanks. – alecxe Dec 05 '14 at 20:16
@yayu and, as a side note, there can be a point there you would have much more selenium than scrapy in the project - at this point, think about whether there is any point in scrapy at all. – alecxe Dec 05 '14 at 20:17
@yayu also [`scrapyjs`](https://github.com/scrapinghub/scrapyjs) might be worth trying - may be you could avoid using `selenium`. – alecxe Dec 05 '14 at 20:18
@alecxe is there any way to set the meta attribute for this `response` object. I know `response meta` is an alias for the `request meta` object. But since there is no `request` associated with this `response`, is there any workaround? – Kashyap Oct 08 '17 at 04:15
6

as of today, HtmlResponse object requires another argument, encoding. You can do it like: HtmlResponse(url='http://scrapy.org', body=u'some body', encoding='utf-8') – Mehmet Kurtipek May 08 '18 at 21:38
On linux, `scrapy shell index.html` does not work and it's well documented in here https://docs.scrapy.org/en/latest/topics/shell.html#launch-the-shell. Use `scrapy shell ./index.html` instead. – BcK Mar 17 '21 at 00:40

score 17 · Answer 2 · edited Sep 22 '20 at 11:57

17

alecxe's answer is right, but this is the correct way to instantiate a Selector from text in scrapy:

>>> from scrapy.selector import Selector
>>> body = '<html><body><span>good</span></body></html>'
>>> Selector(text=body).xpath('//span/text()').get()

'good'

edited Sep 22 '20 at 11:57

Aminah Nuraini

18,120
8
90
108

answered Nov 04 '19 at 09:24

Mohsen Mahmoodi

331
2
8

Kenny Aires · Answer 3 · 2021-02-10T16:58:30.450

You can import native scrapy selector Selector and declare the html string as the text arg to be parsed.

from scrapy.selector import Selector


def get_list_text_from_html_string(html_string):
    html_item = Selector(text=html_string)
    elements = [_li.get() for _li in html_item.css('ul > li::text')]
    return elements

list_html_string = '<ul class="teams">\n<li>Bayern M.</li>\n<li>Palmeiras</li>\n<li>Liverpool</li>\n<li>Flamengo</li></ul>'
print(get_list_text_from_html_string(list_html_string))
>>> ['Bayern M.', 'Tigres', 'Liverpool', 'Flamengo']

scrapy: convert html string to HtmlResponse object

3 Answers3

Linked