Scrapy spider prefers one domain (which slows down process)

Question

I'm working on Crawler which gets a list of domains and for every domain, counts number of urls on this page.

I use CrawlSpider for this purpose but there is a problem.

When I start crawling, it seems to send multiple requests to multiple domains but after some time (one minute), it ends crawling one page (domain).

SETTINGS

CONCURRENT_REQUESTS = 100
CONCURRENT_REQUESTS_PER_DOMAIN = 3
REACTOR_THREADPOOL_MAXSIZE = 20

Here you can see how many urls has been scraped for particular domain:

AFTER 7 minutes - as you can see it aims only on first domain and forgot about others

If scrapy aims just on one domain at once, it logically slows down process. I would like to send requests to multiple domains in short time.

class MainSpider(CrawlSpider):
    name = 'main_spider'
    allowed_domains = []

    rules = (
        Rule(LinkExtractor(), callback='parse_item', follow=True, ),
    )

    def start_requests(self):
        for d in Domain.objects.all():
            self.allowed_domains.append(d.name)
            yield scrapy.Request(d.main_url, callback=self.parse_item, meta={'domain': d})

    def parse_start_url(self, response):
        self.parse_item(response)

    def parse_item(self, response):
        d = response.meta['domain']
        d.number_of_urls = d.number_of_urls + 1
        d.save()
        extractor = LinkExtractor(allow_domains=d.name)
        links = extractor.extract_links(response)
        for link in links:
            yield scrapy.Request(link.url, callback=self.parse_item,meta={'domain': d})

It seems to focus only on the first domain until it doesn't scrape it all.

Why are you manually extracting the links in your parse_item method? This is already done for you by the CrawlSpider internally, following the rule you defined. Also, is the dupe filter on? Without having the [MCVE](https://stackoverflow.com/help/mcve) and the actual domain, it's hard to say more. — bosnjak, Feb 14 '18 at 20:51
@bosnjak I can't just send rules because I need to track domain object (start url) for every request of any depth. I suppose that queue is in this moment filled by many of urls from one domain so another domain urls can't be requested but it's very unefficient. — Milano, Feb 14 '18 at 22:18
I think in this case, after adding start_requests method and parse_item callback, the rules attribute doesn't have any effect. — Milano, Feb 14 '18 at 22:22
You don't have to track the domain, you can always extract it from the response.url. Check [this answer](https://stackoverflow.com/questions/9626535/get-domain-name-from-url) — bosnjak, Feb 15 '18 at 08:24
I want to spare some time. And what if I use rule? Will it request urls in more efficient way? — Milano, Feb 15 '18 at 08:31

Scrapy spider prefers one domain (which slows down process)

0 Answers0