0

I beginner in webscraping and it is possible that I ask wrong question:)For working scrapy+selenium I created middleware

class SeleniumDownloaderMiddleware(object):
    def __init__(self):
        self.driver = None
    @classmethod
    def from_crawler(cls, crawler):
        middleware = cls()
        crawler.signals.connect(middleware.spider_opened,\ 
    signals.spider_opened)
        crawler.signals.connect(middleware.spider_closed,\ 
    signals.spider_closed)
        return middleware
    def process_request(self, request, spider):
        try:
        # JS processing
            self.driver.get(request.url)
            body = to_bytes(self.driver.page_source)
            return HtmlResponse(self.driver.current_url, body=body,\ 
        encoding='utf-8', request=request)
       # CRASH ERROR
            except (WebDriverException, NoSuchWindowException):
            SeleniumDownloaderMiddleware.spider_opened(self, spider)
            self.driver.get(request.url)
            body = to_bytes(self.driver.page_source)
            return HtmlResponse(self.driver.current_url, body=body,/ 
       encoding='utf-8', request=request)

    def spider_opened(self, spider):
        #BAN ON DOWNLOADING
        options.add_experimental_option("prefs", {
        "download.default_directory": "NUL",
        "download.prompt_for_download": False,
         })
         options.add_argument('--ignore-certificate-errors')
         options.add_argument("--test-type")
         self.driver = webdriver.Chrome(chrome_options=options)
    def spider_closed(self, spider):
         if self.driver:
             self.driver.close()
             self.driver.quit()
             self.driver = None

And now any request from scrapy first gets into this selenium middleware ,but I want save pdf without using this middleware,only into scrapy spider

    def parse(self, response):
    # PDF
        for href in response.css('a[href$=".pdf"]::attr(href)').extract() +\
                response.css('a[href$=".PDF"]::attr(href)').extract():
        url = response.urljoin(href)
        yield Request(url=response.urljoin(href), callback=self.save_pdf, 
        priority=1)
    def save_pdf(self, response):
        path = response.url.split('/')[-1]
        self.logger.info('Saving PDF %s', path)
        self.counter += 1
        with open(os.path.join(self.folder, str(self.counter)), 'wb') as 
        file:
        file.write(response.body)

How I can build scrapy request in order ignore selenium middleware?

Ivan Nadin
  • 11
  • 3

2 Answers2

1

Consider using the existing scrapy-selenium Scrapy extension. It works in a way that makes it quite easy to download specific URLs without Selenium.

Alternatively, don’t use Selenium at all. Often what people starting with Scrapy want to do with Selenium can be achieved without Splash or Selenium. See the answers to Can scrapy be used to scrape dynamic content from websites that are using AJAX?

Gallaecio
  • 3,620
  • 2
  • 25
  • 64
0

You can put a condition on request.url in process_request and skip any processing.

if request.url.endswith('.pdf'):
    pass

This should pass on to next middleware then or you can download it right there and return.

Sachin
  • 51
  • 1
  • 6