I have a raw html string that I want to convert to scrapy HTML response object so that I can use the selectors css
and xpath
, similar to scrapy's response
. How can I do it?
Asked
Active
Viewed 1.7k times
30
3 Answers
42
First of all, if it is for debugging or testing purposes, you can use the Scrapy shell
:
$ cat index.html
<div id="test">
Test text
</div>
$ scrapy shell index.html
>>> response.xpath('//div[@id="test"]/text()').extract()[0].strip()
u'Test text'
There are different objects available in the shell during the session, like response
and request
.
Or, you can instantiate an HtmlResponse
class and provide the HTML string in body
:
>>> from scrapy.http import HtmlResponse
>>> response = HtmlResponse(url="my HTML string", body='<div id="test">Test text</div>', encoding='utf-8')
>>> response.xpath('//div[@id="test"]/text()').extract()[0].strip()
u'Test text'

Umair Ayub
- 19,358
- 14
- 72
- 146

alecxe
- 462,703
- 120
- 1,088
- 1,195
-
thanks alecxe, I am using Selenium becuase of some ajaxiness. I want to convert driver.page_source into the same object as resposne so that I can reuse some extractors (using css and xpath selectors) instead of having to resort to lxml. I think your second option is the one I need. – yayu Dec 05 '14 at 20:14
-
1@yayu then, you probably don't need to create an HTML Response, but, rather a `Selector`, see http://stackoverflow.com/questions/18836286/scraping-with-scrapy-and-selenium and http://stackoverflow.com/questions/17975471/selenium-with-scrapy-for-dynamic-page. Might help. Thanks. – alecxe Dec 05 '14 at 20:16
-
@yayu and, as a side note, there can be a point there you would have much more selenium than scrapy in the project - at this point, think about whether there is any point in scrapy at all. – alecxe Dec 05 '14 at 20:17
-
@yayu also [`scrapyjs`](https://github.com/scrapinghub/scrapyjs) might be worth trying - may be you could avoid using `selenium`. – alecxe Dec 05 '14 at 20:18
-
@alecxe is there any way to set the meta attribute for this `response` object. I know `response meta` is an alias for the `request meta` object. But since there is no `request` associated with this `response`, is there any workaround? – Kashyap Oct 08 '17 at 04:15
-
6as of today, HtmlResponse object requires another argument, encoding. You can do it like: HtmlResponse(url='http://scrapy.org', body=u'some body', encoding='utf-8') – Mehmet Kurtipek May 08 '18 at 21:38
-
On linux, `scrapy shell index.html` does not work and it's well documented in here https://docs.scrapy.org/en/latest/topics/shell.html#launch-the-shell. Use `scrapy shell ./index.html` instead. – BcK Mar 17 '21 at 00:40
17
alecxe's answer is right, but this is the correct way to instantiate a Selector
from text
in scrapy:
>>> from scrapy.selector import Selector
>>> body = '<html><body><span>good</span></body></html>'
>>> Selector(text=body).xpath('//span/text()').get()
'good'

Aminah Nuraini
- 18,120
- 8
- 90
- 108

Mohsen Mahmoodi
- 331
- 2
- 8
1
You can import native scrapy selector Selector
and declare the html string as the text arg to be parsed.
from scrapy.selector import Selector
def get_list_text_from_html_string(html_string):
html_item = Selector(text=html_string)
elements = [_li.get() for _li in html_item.css('ul > li::text')]
return elements
list_html_string = '<ul class="teams">\n<li>Bayern M.</li>\n<li>Palmeiras</li>\n<li>Liverpool</li>\n<li>Flamengo</li></ul>'
print(get_list_text_from_html_string(list_html_string))
>>> ['Bayern M.', 'Tigres', 'Liverpool', 'Flamengo']

Kenny Aires
- 1,338
- 12
- 16