I'd like to use scrapy-splash to fetch both the html and screenshot png of the target page. I need to be able to invoke it programmatically. According to the spashy doc, specifying
endpoint='render.json'
and passing argument
'png': 1
should result in a response object ('scrapy_splash.response.SplashJsonResponse') with a .data attribute that contains decoded JSON data representing a png screenshot of the target page.
When the spider (here named 'search') is invoked with
scrapy crawl search
The result is as expected, with response.data['png'] containing the png data.
However, if it is invoked via scrapy's CrawlerProcess, a different response object is returned: 'scrapy.http.response.html.HtmlResponse'. This object does not have the .data attribute.
Here's the code:
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy_splash import SplashRequest
import base64
RUN_CRAWLERPROCESS = False
if RUN_CRAWLERPROCESS:
from crochet import setup
setup()
class SpiderSearch(scrapy.Spider):
name = 'search'
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'
def start_requests(self):
urls = ['https://www.google.com/search?q=test', ]
splash_args = {
'html': 1,
'png': 1,
'width': 1920,
'wait': 0.5,
'render_all': 1,
}
for url in urls:
yield SplashRequest(url=url, callback=self.parse, endpoint='render.json', args=splash_args, )
def parse(self, response):
print(type(response))
for result in response.xpath('//div[@class="r"]'):
url = str(result.xpath('./a/@href').extract_first())
yield {
'url': url
}
png_bytes = base64.b64decode(response.data['png'])
with open('google_results.png', 'wb') as f:
f.write(png_bytes)
splash_args = {
'html': 1,
'png': 1,
'width': 1920,
'wait': 2,
'render_all': 1,
'html5_media': 1,
}
# cue the subsequent url to be fetched (self.parse_result omitted here for brevity)
yield SplashRequest(url=url, callback=self.parse_result, endpoint='render.json', args=splash_args)
if RUN_CRAWLERPROCESS:
runner = CrawlerProcess({'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'})
#d = runner.crawl(SpiderSearch)
#d.addBoth(lambda _: reactor.stop())
#reactor.run()
runner.crawl(SpiderSearch)
runner.start()
Restating:
RUN_CRAWLERPROCESS = False
and invoking by
scrapy crawl search
response is type
class 'scrapy_splash.response.SplashJsonResponse'
But setting
RUN_CRAWLERPROCESS = True
and running the script with CrawlerProcess results in response of type
class 'scrapy.http.response.html.HtmlResponse'
(p.s. I had some trouble with ReactorNotRestartable, so adopted crochet as decribed in this post, which seems to have fixed the problem. I confess I don't understand why, but assume it is unrelated...)
Any thoughts on how to debug this?