16

I want to get all external links from a given website using Scrapy. Using the following code the spider crawls external links as well:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from myproject.items import someItem

class someSpider(CrawlSpider):
  name = 'crawltest'
  allowed_domains = ['someurl.com']
  start_urls = ['http://www.someurl.com/']

  rules = (Rule (LinkExtractor(), callback="parse_obj", follow=True),
  )

  def parse_obj(self,response):
    item = someItem()
    item['url'] = response.url
    return item

What am I missing? Doesn't "allowed_domains" prevent the external links to be crawled? If I set "allow_domains" for LinkExtractor it does not extract the external links. Just to clarify: I wan't to crawl internal links but extract external links. Any help appriciated!

sboss
  • 957
  • 1
  • 7
  • 21
  • If I enable the OffsiteMiddleware the links are not crawled but also not extracted. At least then I can see "Filtered offsite request to 'www.externaldomain'. Surely I'm missing something trivial here? – sboss Jan 15 '15 at 13:37
  • just to understand: do you want to have the list of all external links for a given website ? – aberna Jan 15 '15 at 14:18
  • Yes that is correct! – sboss Jan 15 '15 at 14:19

3 Answers3

16

You can also use the link extractor to pull all the links once you are parsing each page.

The link extractor will filter the links for you. In this example the link extractor will deny links in the allowed domain so it only gets outside links.

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LxmlLinkExtractor
from myproject.items import someItem

class someSpider(CrawlSpider):
  name = 'crawltest'
  allowed_domains = ['someurl.com']
  start_urls = ['http://www.someurl.com/']

  rules = (Rule(LxmlLinkExtractor(allow=()), callback='parse_obj', follow=True),)


  def parse_obj(self,response):
    for link in LxmlLinkExtractor(allow=(),deny = self.allowed_domains).extract_links(response):
        item = someItem()
        item['url'] = link.url
12Ryan12
  • 304
  • 2
  • 9
6

An updated code based on 12Ryan12's answer,

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.item import Item, Field

class MyItem(Item):
    url= Field()


class someSpider(CrawlSpider):
    name = 'crawltest'
    allowed_domains = ['someurl.com']
    start_urls = ['http://www.someurl.com/']
    rules = (Rule(LxmlLinkExtractor(allow=()), callback='parse_obj', follow=True),)

    def parse_obj(self,response):
        item = MyItem()
        item['url'] = []
        for link in LxmlLinkExtractor(allow=(),deny = self.allowed_domains).extract_links(response):
            item['url'].append(link.url)
        return item
Ohad Zadok
  • 3,452
  • 1
  • 22
  • 26
4

A solution would be make usage a process_link function in the SgmlLinkExtractor Documentation here http://doc.scrapy.org/en/latest/topics/link-extractors.html

class testSpider(CrawlSpider):
    name = "test"
    bot_name = 'test'
    allowed_domains = ["news.google.com"]
    start_urls = ["https://news.google.com/"]
    rules = (
    Rule(SgmlLinkExtractor(allow_domains=()), callback='parse_items',process_links="filter_links",follow= True) ,
     )

    def filter_links(self, links):
        for link in links:
            if self.allowed_domains[0] not in link.url:
                print link.url

        return links

    def parse_items(self, response):
        ### ...
sertsedat
  • 3,490
  • 1
  • 25
  • 45
aberna
  • 5,594
  • 2
  • 28
  • 33
  • @sboss I noticed you accepted and after downgraded my proposed solution. The code is working fine, did you notice any other issue ? – aberna Jan 19 '15 at 12:55
  • Hi aberna, sorry for the downgrade. I found 12Ryan12:s reply more elegant as it enables me to use the built in duplicate filters etc. I appriciate the reply though! – sboss Jan 21 '15 at 10:56