2

I have the following scrapy CrawlSpider:

import logger as lg
from scrapy.crawler import CrawlerProcess
from scrapy.http import Response
from scrapy.spiders import CrawlSpider, Rule
from scrapy_splash import SplashTextResponse
from urllib.parse import urlencode
from scrapy.linkextractors import LinkExtractor
from scrapy.http import HtmlResponse

logger = lg.get_logger("oddsportal_spider")


class SeleniumScraper(CrawlSpider):
    
    name = "splash"
    
    custom_settings = {
        "USER_AGENT": "*",
        "LOG_LEVEL": "WARNING",
        "DOWNLOADER_MIDDLEWARES": {
            'scraper_scrapy.odds.middlewares.SeleniumMiddleware': 543,
        },
    }

    httperror_allowed_codes = [301]
    
    start_urls = ["https://www.oddsportal.com/tennis/results/"]
    
    rules = (
        Rule(
            LinkExtractor(allow="/atp-buenos-aires/results/"),
            callback="parse_tournament",
            follow=True,
        ),
        Rule(
            LinkExtractor(
                allow="/tennis/",
                restrict_xpaths=("//td[@class='name table-participant']//a"),
            ),
            callback="parse_match",
        ),
    )

    def parse_tournament(self, response: Response):
        logger.info(f"Parsing tournament - {response.url}")
    
    def parse_match(self, response: Response):
        logger.info(f"Parsing match - {response.url}")


process = CrawlerProcess()
process.crawl(SeleniumScraper)
process.start()

The Selenium middleware is as follows:

class SeleniumMiddleware:

    @classmethod
    def from_crawler(cls, crawler):
        middleware = cls()
        crawler.signals.connect(middleware.spider_opened, signals.spider_opened)
        crawler.signals.connect(middleware.spider_closed, signals.spider_closed)
        return middleware

    def process_request(self, request, spider):
        logger.debug(f"Selenium processing request - {request.url}")
        self.driver.get(request.url)
        return HtmlResponse(
            request.url,
            body=self.driver.page_source,
            encoding='utf-8',
            request=request,
        )

    def spider_opened(self, spider):
        options = webdriver.FirefoxOptions()
        options.add_argument("--headless")
        self.driver = webdriver.Firefox(
            options=options,
            executable_path=Path("/opt/geckodriver/geckodriver"),
        )

    def spider_closed(self, spider):
        self.driver.close()

End to end this takes around a minute for around 50ish pages. To try and speed things up and take advantage of multiple threads and Javascript I've implemented the following scrapy_splash spider:

class SplashScraper(CrawlSpider):
    
    name = "splash"
    
    custom_settings = {
        "USER_AGENT": "*",
        "LOG_LEVEL": "WARNING",
        "SPLASH_URL": "http://localhost:8050",
        "DOWNLOADER_MIDDLEWARES": {
            'scrapy_splash.SplashCookiesMiddleware': 723,
            'scrapy_splash.SplashMiddleware': 725,
            'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
        },
        "SPIDER_MIDDLEWARES": {'scrapy_splash.SplashDeduplicateArgsMiddleware': 100},
        "DUPEFILTER_CLASS": 'scrapy_splash.SplashAwareDupeFilter',
        "HTTPCACHE_STORAGE": 'scrapy_splash.SplashAwareFSCacheStorage',
    }

    httperror_allowed_codes = [301]
    
    start_urls = ["https://www.oddsportal.com/tennis/results/"]
    
    rules = (
        Rule(
            LinkExtractor(allow="/atp-buenos-aires/results/"),
            callback="parse_tournament",
            process_request="use_splash",
            follow=True,
        ),
        Rule(
            LinkExtractor(
                allow="/tennis/",
                restrict_xpaths=("//td[@class='name table-participant']//a"),
            ),
            callback="parse_match",
            process_request="use_splash",
        ),
    )

    def process_links(self, links): 
        for link in links: 
            link.url = "http://localhost:8050/render.html?" + urlencode({'url' : link.url}) 
        return links

    def _requests_to_follow(self, response):
        if not isinstance(response, (HtmlResponse, SplashTextResponse)):
            return
        seen = set()
        for rule_index, rule in enumerate(self._rules):
            links = [lnk for lnk in rule.link_extractor.extract_links(response)
                     if lnk not in seen]
            for link in rule.process_links(links):
                seen.add(link)
                request = self._build_request(rule_index, link)
                yield rule.process_request(request, response)

    def use_splash(self, request, response):
        request.meta.update(splash={'endpoint': 'render.html'})
        return request

    def parse_tournament(self, response: Response):
        logger.info(f"Parsing tournament - {response.url}")
    
    def parse_match(self, response: Response):
        logger.info(f"Parsing match - {response.url}")

However, this takes about the same amount of time. I was hoping to see a big increase in speed :(

I've tried playing around with different DOWNLOAD_DELAY settings but that hasn't made things any faster.

All the concurrency settings are left at their defaults.

Any ideas on if/how I'm going wrong?

Jossy
  • 589
  • 2
  • 12
  • 36

1 Answers1

0

Taking a stab at an answer here with no experience of the libraries.

It looks like Scrapy Crawlers themselves are single-threaded. To get multi-threaded behavior you need to configure your application differently or write code that makes it behave mulit-threaded. It sounds like you've already tried this so this is probably not news to you but make sure you have configured the CONCURRENT_REQUESTS and REACTOR_THREADPOOL_MAXSIZE.

https://docs.scrapy.org/en/latest/topics/settings.html?highlight=thread#reactor-threadpool-maxsize

I can't imagine there is much CPU work going on in the crawling process so i doubt it's a GIL issue.

Excluding GIL as an option there are two possibilities here:

  1. Your crawler is not actually multi-threaded. This may be because you are missing some setup or configuration that would make it so. i.e. You may have set the env variables correctly but your crawler is written in a way that is processing requests for urls synchronously instead of submitting them to a queue.

To test this, create a global object and store a counter on it. Each time your crawler starts a request increment the counter. Each time your crawler finishes a request, decrement the counter. Then run a thread that prints the counter every second. If your counter value is always 1, then you are still running synchronously.

# global_state.py

GLOBAL_STATE = {"counter": 0}

# middleware.py

from global_state import GLOBAL_STATE

class SeleniumMiddleware:

    def process_request(self, request, spider):
        GLOBAL_STATE["counter"] += 1
        self.driver.get(request.url)
        GLOBAL_STATE["counter"] -= 1

        ...

# main.py

from global_state import GLOBAL_STATE
import threading
import time

def main():
  gst = threading.Thread(target=gs_watcher)
  gst.start()

  # Start your app here

def gs_watcher():
  while True:
    print(f"Concurrent requests: {GLOBAL_STATE['counter']}")
    time.sleep(1)
  1. The site you are crawling is rate limiting you.

To test this, run the application multiple times. If you go from 50 req/s to 25 req/s per application then you are being rate limited. To skirt around this use a VPN to hop-around.


If after that you find that you are running concurrent requests, and you are not being rate limited, then there is something funky going on in the libraries. Try removing chunks of code until you get to the bare minimum of what you need to crawl. If you have gotten to the absolute bare minimum implementation and it's still slow then you now have a minimal reproducible example and can get much better/informed help.

micah
  • 7,596
  • 10
  • 49
  • 90
  • Thanks for giving this a go :) Tried this and the `SeleniumScraper` definitely runs sequentially (as expected) however `SplashScraper` gives the following error: `MemoryError: Cannot allocate write+execute memory for ffi.callback(). You might be running on a system that prevents this. For more information, see https://cffi.readthedocs.io/en/latest/using.html#callbacks` – Jossy Jan 22 '22 at 17:03
  • Hang on - that's happening without your code. I have changed to an M1 Mac recently and there seem to be a few reports of this error. I'll try on my old machine... – Jossy Jan 22 '22 at 17:20
  • According to the docs the `cffi callback` exists for backwards compatibility and shouldn't be used. I'd check if a newer version of the selenium lib uses the preferred method. Otherwise you may need to set an env variable to allow for insecure write+execute. – micah Jan 22 '22 at 17:30
  • I'm assuming you definitely need selenium for your project. But it actually may be selenium that is blocking. https://stackoverflow.com/questions/30808606/can-selenium-use-multi-threading-in-one-browser ~ Can you use multiple selenium web drivers at once? If you don't need to execute javascript in a headless browser, you may be better off using `requests` instead. – micah Jan 22 '22 at 17:34
  • 1
    Hey. So `SplashScraper` not only works fine on my Intel Mac but it also runs a lot faster (despite having half the number of cores). There definitely seems to have been an issue with M1 Macs for standard scrapy (https://stackoverflow.com/q/67556847/11277108) but this looks like it's been resolved. Perhaps the issue hasn't been resolved within scrapy_splash yet. I'm actually using scrapy_splash as the alternative to Selenium as I believe you can only run this middleware synchronously. – Jossy Jan 22 '22 at 18:09