10

I'm beginner in Python, and I'm using Scrapy for a personnel web project.

I use Scrapy to extract data from several websites repeatedly, so I need to check on every crawling if a link is already in the database before adding it. I did this in a piplines.py class:

class DuplicatesPipline(object):
    def process_item(self, item, spider):
        if memc2.get(item['link']) is None:
            return item
        else:
            raise DropItem('Duplication %s', item['link'])

But I heard that using Middleware is better for this task.

I found it a little hard to use Middleware in Scrapy, can anyone please redirect me to a good tutorial.

advices are welcome.

Thanks,

Edit:

I'm using MySql and memcache.

Here is my try according to @Talvalin answer:

# -*- coding: utf-8 -*-

from scrapy.exceptions import IgnoreRequest
import MySQLdb as mdb
import memcache

connexion = mdb.connect('localhost','dev','passe','mydb')
memc2 = memcache.Client(['127.0.0.1:11211'], debug=1)

class IgnoreDuplicates():

    def __init__(self):
        #clear memcache object
        memc2.flush_all()

        #update memc2
        with connexion:
            cur = connexion.cursor()
            cur.execute('SELECT link, title FROM items')
            for item in cur.fetchall():
                memc2.set(item[0], item[1])

    def precess_request(self, request, spider):
        #if the url is not in memc2 keys, it returns None.
        if memc2.get(request.url) is None:
            return None
        else:
            raise IgnoreRequest()

DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.IgnoreDuplicates': 543,
    'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 500, }

But it seems that the process_request method is ignored when crawling.

Thanks in advance,

elhoucine
  • 2,356
  • 4
  • 21
  • 37
  • Essentially, you need to create a downloader middleware class that implements a `process_response` method and loads your crawled URLs and checks the URL of the incoming response to see if there is a match. http://doc.scrapy.org/en/latest/topics/downloader-middleware.html – Talvalin Apr 09 '14 at 15:14
  • What DB are you using by the way? – Talvalin Apr 09 '14 at 15:17
  • I'm using MySql and memcache. Thanks for the response. – elhoucine Apr 09 '14 at 16:06
  • The code you've posted above refers to `precess_request` rather than `process_request`. If the code above was copied from the code that you're using, then that might explain why it is not working. – Talvalin Apr 10 '14 at 13:30

1 Answers1

10

Here's some example middleware code that loads urls from a sqlite3 table (Id INT, url TEXT)into a set, and then checks request urls against the set to determine if the url should be ignored or not. It should be reasonably straightforward to adapt this code to use MySQL and memcache, but please let me know if you have any issues or questions. :)

import sqlite3
from scrapy.exceptions import IgnoreRequest

class IgnoreDuplicates():

    def __init__(self):
        self.crawled_urls = set()

        with sqlite3.connect('C:\dev\scrapy.db') as conn:
            cur = conn.cursor()
            cur.execute("""SELECT url FROM CrawledURLs""")
            self.crawled_urls.update(x[0] for x in cur.fetchall())

        print self.crawled_urls

    def process_request(self, request, spider):
        if request.url in self.crawled_urls:
            raise IgnoreRequest()
        else:
            return None

On the off-chance you have import issues like me and are about to punch your monitor, the code above was in a middlewares.py file that was placed in the top-level project folder with the following DOWNLOADER_MIDDLEWARES setting

DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.IgnoreDuplicates': 543,
    'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 500,
}
Talvalin
  • 7,789
  • 2
  • 30
  • 40
  • Hello Talvalin, I tested your solution but it seems that Scrapy just Ignore process_request method when crawling, so it's not ignoring duplicate links. I checked the documentation and found just methods like process_spider_input, process_spider_output.. but no process_request. Thanks – elhoucine Apr 09 '14 at 18:44
  • 1
    When you run your spider, does IgnoreDuplicates show up in the logs under enabled middlewares? – Talvalin Apr 09 '14 at 20:24
  • I think yes. I run "scrapy crawl project_name" ---------- 2014-04-09 22:36:07+0100 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, IgnoreDuplicates, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats 2014-04-09 22:36:07+0100 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware – elhoucine Apr 09 '14 at 21:41
  • 1
    Try inserting a `print request.url` in `process_request` just to test if that code is reached? I had a few problems initially, but eventually realised that it was because my urls table was wrong... – Talvalin Apr 09 '14 at 22:43
  • Thanks it works. --- 'downloader/exception_type_count/scrapy.exceptions.IgnoreRequest': 310,--- – elhoucine Apr 10 '14 at 15:06
  • Thanks, I was able to use this in combination with my MongoDB pipeline. – vreen Aug 30 '16 at 07:29