6

I am writing a Scrapy spider that crawls a set of URLs once per day. However, some of these websites are very big, so I cannot crawl the full site daily, nor would I want to generate the massive traffic necessary to do so.

An old question (here) asked something similar. However, the upvoted response simply points to a code snippet (here), which seems to require something of the request instance, though that is not explained in the response, nor on the page containing the code snippet.

I'm trying to make sense of this but find middleware a bit confusing. A complete example of a scraper which can be be run multiple times without rescraping URLs would be very useful, whether or not it uses the linked middleware.

I've posted code below to get the ball rolling but I don't necessarily need to use this middleware. Any scrapy spider that can crawl daily and extract new URLs will do. Obviously one solution is to just write out a dictionary of scraped URLs and then check to confirm that each new URL is/isn't in the dictionary, but that seems very slow/inefficient.

Spider

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from cnn_scrapy.items import NewspaperItem



class NewspaperSpider(CrawlSpider):
    name = "newspaper"
    allowed_domains = ["cnn.com"]
    start_urls = [
        "http://www.cnn.com/"
    ]

    rules = (
        Rule(LinkExtractor(), callback="parse_item", follow=True),
    )

    def parse_item(self, response):
        self.log("Scraping: " + response.url)
        item = NewspaperItem()
        item["url"] = response.url
        yield item

Items

import scrapy


class NewspaperItem(scrapy.Item):
    url = scrapy.Field()
    visit_id = scrapy.Field()
    visit_status = scrapy.Field()

Middlewares (ignore.py)

from scrapy import log
from scrapy.http import Request
from scrapy.item import BaseItem
from scrapy.utils.request import request_fingerprint

from cnn_scrapy.items import NewspaperItem

class IgnoreVisitedItems(object):
    """Middleware to ignore re-visiting item pages if they were already visited
    before. The requests to be filtered by have a meta['filter_visited'] flag
    enabled and optionally define an id to use for identifying them, which
    defaults the request fingerprint, although you'd want to use the item id,
    if you already have it beforehand to make it more robust.
    """

    FILTER_VISITED = 'filter_visited'
    VISITED_ID = 'visited_id'
    CONTEXT_KEY = 'visited_ids'

    def process_spider_output(self, response, result, spider):
        context = getattr(spider, 'context', {})
        visited_ids = context.setdefault(self.CONTEXT_KEY, {})
        ret = []
        for x in result:
            visited = False
            if isinstance(x, Request):
                if self.FILTER_VISITED in x.meta:
                    visit_id = self._visited_id(x)
                    if visit_id in visited_ids:
                        log.msg("Ignoring already visited: %s" % x.url,
                                level=log.INFO, spider=spider)
                        visited = True
            elif isinstance(x, BaseItem):
                visit_id = self._visited_id(response.request)
                if visit_id:
                    visited_ids[visit_id] = True
                    x['visit_id'] = visit_id
                    x['visit_status'] = 'new'
            if visited:
                ret.append(NewspaperItem(visit_id=visit_id, visit_status='old'))
            else:
                ret.append(x)
        return ret

    def _visited_id(self, request):
        return request.meta.get(self.VISITED_ID) or request_fingerprint(request)
  • and what about urls that need to be found within other responses? – eLRuLL Jun 10 '16 at 02:42
  • I'm assuming that after I've been to a URL, no new URLs will be found on that page (with the exception of the start_urls). Or have I misunderstood your question? – Henry David Thorough Jun 10 '16 at 02:43
  • 1
    no that is ok, then I think your approach (or a similar one) is ok, the idea is to save the ones that were already done, if they are a lot, I would recommend using a separate database, also Scrapy saves the requests like a fingerprint, which helps on their own deduplication component. – eLRuLL Jun 10 '16 at 02:48
  • Ah so you mean write out all the URLs to a database and for each new URL, skip it if it's in the database? With a lookup of some sort? – Henry David Thorough Jun 10 '16 at 02:49
  • 1
    yes that's the only way, of course saving only URLs will work for GET requests, if you have POST requests, the request fingerprint of Scrapy could help – eLRuLL Jun 10 '16 at 02:51

1 Answers1

1

Here's the thing, what you want to do is to be able to have one database of which your crawl is scheduled/croned. dupflier.middleware or not your still having to scrape the entire site regardless... and I feel despite the obviousness that the code provided cant be the entire project, that that WAYY too much code.

I'm not exactly sure what it is that you were scraping but I'm going to assume right now you have CNN as the projects URL that you're scraping articles?

what I would do would be to use CNNs RSS feeds or even site map given that provides due date with the article meta and using the OS module...

Define the date each crawl instance Using regex restrict the itemization with the crawlers defined date against the date articles posted deploy and schedule crawl to/in scrapinghub Use scrapinghubs python api client to iterate through items

Still would crawl entire sites content but with a xmlspider or rssspider class is perfect for parsing all that data more quickly... And now that the db is available in a "cloud" ... I feel one could be more modular with the scale-ability of the project as well much easier portability/cross-compatibility

Im sure the flow im describing would be subject to some tinkering but the idea is straight forward.

scriptso
  • 677
  • 4
  • 14