0

I'm working on Crawler which gets a list of domains and for every domain, counts number of urls on this page.

I use CrawlSpider for this purpose but there is a problem.

When I start crawling, it seems to send multiple requests to multiple domains but after some time (one minute), it ends crawling one page (domain).

SETTINGS

CONCURRENT_REQUESTS = 100
CONCURRENT_REQUESTS_PER_DOMAIN = 3
REACTOR_THREADPOOL_MAXSIZE = 20

Here you can see how many urls has been scraped for particular domain:

enter image description here

AFTER 7 minutes - as you can see it aims only on first domain and forgot about others

enter image description here

If scrapy aims just on one domain at once, it logically slows down process. I would like to send requests to multiple domains in short time.

class MainSpider(CrawlSpider):
    name = 'main_spider'
    allowed_domains = []

    rules = (
        Rule(LinkExtractor(), callback='parse_item', follow=True, ),
    )

    def start_requests(self):
        for d in Domain.objects.all():
            self.allowed_domains.append(d.name)
            yield scrapy.Request(d.main_url, callback=self.parse_item, meta={'domain': d})

    def parse_start_url(self, response):
        self.parse_item(response)

    def parse_item(self, response):
        d = response.meta['domain']
        d.number_of_urls = d.number_of_urls + 1
        d.save()
        extractor = LinkExtractor(allow_domains=d.name)
        links = extractor.extract_links(response)
        for link in links:
            yield scrapy.Request(link.url, callback=self.parse_item,meta={'domain': d})

It seems to focus only on the first domain until it doesn't scrape it all.

Milano
  • 18,048
  • 37
  • 153
  • 353
  • Why are you manually extracting the links in your parse_item method? This is already done for you by the CrawlSpider internally, following the rule you defined. Also, is the dupe filter on? Without having the [MCVE](https://stackoverflow.com/help/mcve) and the actual domain, it's hard to say more. – bosnjak Feb 14 '18 at 20:51
  • @bosnjak I can't just send rules because I need to track domain object (start url) for every request of any depth. I suppose that queue is in this moment filled by many of urls from one domain so another domain urls can't be requested but it's very unefficient. – Milano Feb 14 '18 at 22:18
  • I think in this case, after adding start_requests method and parse_item callback, the rules attribute doesn't have any effect. – Milano Feb 14 '18 at 22:22
  • You don't have to track the domain, you can always extract it from the response.url. Check [this answer](https://stackoverflow.com/questions/9626535/get-domain-name-from-url) – bosnjak Feb 15 '18 at 08:24
  • I want to spare some time. And what if I use rule? Will it request urls in more efficient way? – Milano Feb 15 '18 at 08:31
  • Probably the same, just pick one and stick with it. – bosnjak Feb 15 '18 at 09:26

0 Answers0