I need to scrape the first 10-20 internal links during a broad crawl so I don't impact the web servers, but there are too many domains for "allowed_domains". I'm asking here because the Scrapy documentation doesn't cover this and I can't find the answer via Google.
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.item import Item, Field
class DomainLinks(Item):
links = Field()
class ScapyProject(CrawlSpider):
name = 'scapyproject'
#allowed_domains = []
start_urls = ['big domains list loaded from database']
rules = (Rule(LxmlLinkExtractor(allow=()), callback='parse_links', follow=True),)
def parse_start_url(self, response):
self.parse_links(response)
def parse_links(self, response):
item = DomainLinks()
item['links'] = []
domain = response.url.strip("http://","").strip("https://","").strip("www.").strip("ww2.").split("/")[0]
links = LxmlLinkExtractor(allow=(),deny = ()).extract_links(response)
links = [link for link in links if domain in link.url]
# Filter duplicates and append to
for link in links:
if link.url not in item['links']:
item['links'].append(link.url)
return item
Is the following list comprehension the best way to filter links without using allowed_domains list and LxmlLinkExtractor allow filter, as these both appear to use regex, which will impact performance and limit the size of the allowed domains list, if each scrapped link is regex matched against every domain in the list?
links = [link for link in links if domain in link.url]
Another problem I am struggling to solve is, how do I get the spider to only follow internal links without using the allowed_domains list? Custom middleware?
Thanks