1

I'd like to use scrapy-splash to fetch both the html and screenshot png of the target page. I need to be able to invoke it programmatically. According to the spashy doc, specifying

endpoint='render.json'

and passing argument

'png': 1

should result in a response object ('scrapy_splash.response.SplashJsonResponse') with a .data attribute that contains decoded JSON data representing a png screenshot of the target page.

When the spider (here named 'search') is invoked with

scrapy crawl search

The result is as expected, with response.data['png'] containing the png data.

However, if it is invoked via scrapy's CrawlerProcess, a different response object is returned: 'scrapy.http.response.html.HtmlResponse'. This object does not have the .data attribute.

Here's the code:

import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy_splash import SplashRequest
import base64

RUN_CRAWLERPROCESS = False

if RUN_CRAWLERPROCESS:
    from crochet import setup
    setup()

class SpiderSearch(scrapy.Spider):
    name = 'search'
    user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'

    def start_requests(self):
        urls = ['https://www.google.com/search?q=test', ]
        splash_args = {
            'html': 1,
            'png': 1,
            'width': 1920,
            'wait': 0.5,
            'render_all': 1,
        }
        for url in urls:
            yield SplashRequest(url=url, callback=self.parse, endpoint='render.json', args=splash_args, )

    def parse(self, response):
        print(type(response))
        for result in response.xpath('//div[@class="r"]'): 
            url = str(result.xpath('./a/@href').extract_first())
            yield {
                'url': url
            }

        png_bytes = base64.b64decode(response.data['png'])
        with open('google_results.png', 'wb') as f:
            f.write(png_bytes)

        splash_args = {
            'html': 1,
            'png': 1,
            'width': 1920,
            'wait': 2,
            'render_all': 1,
            'html5_media': 1,
        }
        # cue the subsequent url to be fetched (self.parse_result omitted here for brevity)
        yield SplashRequest(url=url, callback=self.parse_result, endpoint='render.json', args=splash_args)

if RUN_CRAWLERPROCESS:
    runner = CrawlerProcess({'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'})
    #d = runner.crawl(SpiderSearch)
    #d.addBoth(lambda _: reactor.stop())
    #reactor.run()
    runner.crawl(SpiderSearch)
    runner.start()

Restating:

RUN_CRAWLERPROCESS = False 

and invoking by

scrapy crawl search

response is type

class 'scrapy_splash.response.SplashJsonResponse'

But setting

RUN_CRAWLERPROCESS = True 

and running the script with CrawlerProcess results in response of type

class 'scrapy.http.response.html.HtmlResponse'

(p.s. I had some trouble with ReactorNotRestartable, so adopted crochet as decribed in this post, which seems to have fixed the problem. I confess I don't understand why, but assume it is unrelated...)

Any thoughts on how to debug this?

user2081488
  • 19
  • 1
  • 2

1 Answers1

1

If you're running this code as a standalone script the settings module will never be loaded and your crawler will not know about the Splashy middleware (which is what's adding the .data attribute you're referencing in .parse).

You can load these settings within your script by calling get_project_settings and passing the result to your Crawler:

from scrapy.utils.project import get_project_settings

# ...

project_settings = get_project_settings()


process = CrawlerProcess(project_settings)
Anthony E
  • 11,072
  • 2
  • 24
  • 44