1

I need to scrape the first 10-20 internal links during a broad crawl so I don't impact the web servers, but there are too many domains for "allowed_domains". I'm asking here because the Scrapy documentation doesn't cover this and I can't find the answer via Google.

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.item import Item, Field

class DomainLinks(Item):
    links = Field()

class ScapyProject(CrawlSpider):
    name = 'scapyproject'

    #allowed_domains = []
    start_urls = ['big domains list loaded from database']

    rules = (Rule(LxmlLinkExtractor(allow=()), callback='parse_links', follow=True),)

    def parse_start_url(self, response):
        self.parse_links(response)

    def parse_links(self, response):
        item = DomainLinks()
        item['links'] = []

        domain = response.url.strip("http://","").strip("https://","").strip("www.").strip("ww2.").split("/")[0]

        links = LxmlLinkExtractor(allow=(),deny = ()).extract_links(response)
        links = [link for link in links if domain in link.url]

        # Filter duplicates and append to
        for link in links:
            if link.url not in item['links']:
                item['links'].append(link.url)

        return item

Is the following list comprehension the best way to filter links without using allowed_domains list and LxmlLinkExtractor allow filter, as these both appear to use regex, which will impact performance and limit the size of the allowed domains list, if each scrapped link is regex matched against every domain in the list?

 links = [link for link in links if domain in link.url]

Another problem I am struggling to solve is, how do I get the spider to only follow internal links without using the allowed_domains list? Custom middleware?

Thanks

mbudge
  • 557
  • 13
  • 28

1 Answers1

1

Yes, your list comprehension is a good, maybe the best way to handle this.

links = [link for link in links if domain in link.url]

It has the following advantages:

  • it will only accept internal links and ignore links to other whitelisted domains
  • it is quite fast (as your scraper will most probably be network bound this shouldn't matter much)
  • it works without a lookup table or some huge regexes

Besides this I suggest to use urllib to extract the domain: Get domain name from URL

And if you want to crawl only internal links, you can achieve this by:

  • removing the rules
  • renaming parse_links into parse
  • and manually create new Requests to follow only the internal links:

    def parse(self, response):
    
        # your code ... removed for brevity ...
    
        links = [link for link in links if domain in link.url]
    
        # Filter duplicates and append to
        for link in links:
            if link.url not in item['links']:
                item['links'].append(link.url)
            yield Request(link)
    
        yield item
    
Community
  • 1
  • 1
Done Data Solutions
  • 2,156
  • 19
  • 32