I would like to analyze the link structure and text content of a number of inter-connected websites (e.g. websites about science fiction). I have a list of authorised websites that I would like to scrape, about 300 of them. Once I have the crawled pages in a db, I will analyse the data with other tools.
It seems that Scrapy is one of the best tools out there to perform this kind of task, but I am struggling to define a spider that performs what I need. I need the following features:
- scrape only certain domains (list defined in an external text file that might change)
- limit depth of recursion to a given value (e.g. 3).
- for each page, save title, html content, and links in a sql lite db
- use cache to avoid hammering the websites to download the same pages. The cache should have an expiry date (e.g. 1 week). After the expiry date, the page should be scraped again.
- I want to run spider manually (for the moment I don't need scheduling).
To achieve this goal, I have started to define a spider in this way:
# http://doc.scrapy.org/en/latest/intro/tutorial.html
from scrapy.spider import Spider
from scrapy import log
from scrapy.http.request import Request
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.selector import Selector
from ..items import PageItem
class PageSpider(CrawlSpider):
name = "page"
rules = (Rule(SgmlLinkExtractor(allow=(),), callback='parse_item', follow=True),)
#restrict_xpaths=('//body',)),
def parse_item(self, response):
log.msg( "PageSpider.parse" )
log.msg( response.url )
#sel = Selector(response)
links = sel.xpath('//body//a/@href').extract()
#log.msg("links")
#log.msg(links)
item = PageItem()
item['url'] = response.url
item['content'] = response.body
item['links'] = "\n".join( links )
return item
How can I load a list of allowed sites into the Spider in allow
?
To store the items, I am using a pipeline which seems to work ok (it has no temporal logic yet, but it stores data in a local db):
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
from scrapy import log
import sqlite3
import time
#from items import PageItem
class MyProjectPipeline(object):
_db_conn = None;
def process_item(self, item, spider):
log.msg( "process item" )
if not self.url_exists(item['url']):
# insert element
c = MyProjectPipeline._db_conn.cursor()
items = [( item['url'], item['content'], item['links'], time.time() )]
c.executemany('INSERT INTO pages_dump VALUES (?,?,?,?)', items)
MyProjectPipeline._db_conn.commit()
return item
def open_spider(self, spider):
# https://docs.python.org/2/library/sqlite3.html
log.msg( "open sql lite DB" )
MyProjectPipeline._db_conn = sqlite3.connect('consp_crawl_pages.db')
c = MyProjectPipeline._db_conn.cursor()
# create table
c.execute('''create table if not exists pages_dump ( p_url PRIMARY KEY, p_content, p_links, p_ts )''')
MyProjectPipeline._db_conn.commit()
def close_spider(self, spider):
log.msg( "closing sql lite DB" )
MyProjectPipeline._db_conn.close()
def url_exists(self, url):
c = MyProjectPipeline._db_conn.cursor()
c.execute("SELECT p_url FROM pages_dump WHERE p_url = ?", (url,))
data=c.fetchone()
if data is None:
return False
return True
How can I stop the spider from requesting a URL if it is already present in the db?
Am I adopting a sensible approach or are there more natural ways of doing these things in Scrapy? My Python isn't great, so coding suggestions are also welcome :-)
Thanks for any comments, Mulone