Scraping large number of sites with Scrapy

Question

I would like to analyze the link structure and text content of a number of inter-connected websites (e.g. websites about science fiction). I have a list of authorised websites that I would like to scrape, about 300 of them. Once I have the crawled pages in a db, I will analyse the data with other tools.

It seems that Scrapy is one of the best tools out there to perform this kind of task, but I am struggling to define a spider that performs what I need. I need the following features:

scrape only certain domains (list defined in an external text file that might change)
limit depth of recursion to a given value (e.g. 3).
for each page, save title, html content, and links in a sql lite db
use cache to avoid hammering the websites to download the same pages. The cache should have an expiry date (e.g. 1 week). After the expiry date, the page should be scraped again.
I want to run spider manually (for the moment I don't need scheduling).

To achieve this goal, I have started to define a spider in this way:

# http://doc.scrapy.org/en/latest/intro/tutorial.html

from scrapy.spider import Spider
from scrapy import log
from scrapy.http.request import Request
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.selector import Selector
from ..items import PageItem

class PageSpider(CrawlSpider):
    name = "page"   

    rules = (Rule(SgmlLinkExtractor(allow=(),), callback='parse_item', follow=True),)   
    #restrict_xpaths=('//body',)), 

    def parse_item(self, response):
        log.msg( "PageSpider.parse" )
        log.msg( response.url )
        #sel = Selector(response)
        links = sel.xpath('//body//a/@href').extract()
        #log.msg("links")
        #log.msg(links)
        item = PageItem()
        item['url'] = response.url
        item['content'] = response.body
        item['links'] = "\n".join( links )
        return item

How can I load a list of allowed sites into the Spider in allow? To store the items, I am using a pipeline which seems to work ok (it has no temporal logic yet, but it stores data in a local db):

# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
from scrapy import log
import sqlite3
import time
#from items import PageItem

class MyProjectPipeline(object):

    _db_conn = None;

    def process_item(self, item, spider):
        log.msg( "process item" )
        if not self.url_exists(item['url']):
            # insert element
            c = MyProjectPipeline._db_conn.cursor()
            items = [( item['url'], item['content'], item['links'], time.time() )]
            c.executemany('INSERT INTO pages_dump VALUES (?,?,?,?)', items)
            MyProjectPipeline._db_conn.commit()
        return item

    def open_spider(self, spider):
        # https://docs.python.org/2/library/sqlite3.html
        log.msg( "open sql lite DB" )
        MyProjectPipeline._db_conn = sqlite3.connect('consp_crawl_pages.db')
        c = MyProjectPipeline._db_conn.cursor()
        # create table
        c.execute('''create table if not exists pages_dump ( p_url PRIMARY KEY, p_content, p_links, p_ts )''')
        MyProjectPipeline._db_conn.commit()

    def close_spider(self, spider):
        log.msg( "closing sql lite DB" )
        MyProjectPipeline._db_conn.close()

    def url_exists(self, url):
        c = MyProjectPipeline._db_conn.cursor()
        c.execute("SELECT p_url FROM pages_dump WHERE p_url = ?", (url,))
        data=c.fetchone()
        if data is None:
            return False
        return True

How can I stop the spider from requesting a URL if it is already present in the db?

Am I adopting a sensible approach or are there more natural ways of doing these things in Scrapy? My Python isn't great, so coding suggestions are also welcome :-)

Thanks for any comments, Mulone

I believe that it is often better to use downloader middleware to filter out duplicate URLs before a request is actually sent out. Have a look at this question for more details of how this can be achieved: http://stackoverflow.com/questions/22963585/scrapy-middleware-tutorial — Talvalin, Apr 10 '14 at 13:33
For cache, there is a setting in scrapy https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#std:setting-HTTPCACHE_ENABLED — Raheel, Jul 21 '17 at 05:56

score 2 · Answer 1 · edited May 23 '17 at 11:45

So, I realize this is a very late answer, but here are is my attempt to reply your questions:

1) For scraping domains listed in a txt file, you just need to populate the spider attribute allowed_domains on the __init__ method:

class PageSpider(CrawlSpider):
    name = "page"
    def __init__(self, *args, **kwargs):
        self.allowed_domains = open('YOUR_FILE').read().splitlines()
        super(PageSpider, self).__init__(*args, **kwargs)

2) To limit the depth, just set the DEPTH_LIMIT setting.

3) To save the data in a db, pipeline is the way to go -- you're doing it right. =)

4) Scrapy already avoids duplicated requests in the same crawl by deault, but to avoid duplicate requests made in previous crawls, you'll have to choose a mechanism to store the requests from previous crawls externally and filter in a spider middleware like the one linked by Talvalin in the comments: https://stackoverflow.com/a/22968884/149872

Scraping large number of sites with Scrapy

1 Answers1