I've been trying to make a custom middleware in Scrapy, which will flag urls containing certain patterns using regex. In short, there is a list of exceptions, and each url is checked against it. However, the middleware does not manage to properly identify the exceptions (it always returns a None result for re.match()).
I've tried implementing regex in a separate script, and it works. I'd really appreciate any ideas as to why this may be happening.
Here is the example situation:
1)Spider
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class AmazonSpider(CrawlSpider):
name = 'amazon'
allowed_domains = ['amazon.co.uk']
start_urls = ['http://amazon.co.uk/']
rules = (
Rule(LinkExtractor(allow=''), callback='parse_item', follow=True),
)
def parse_item(self, response):
i = {}
i['url'] = response.url
return i
2)Settings:
BOT_NAME = 'foo'
SPIDER_MODULES = ['foo.spiders']
NEWSPIDER_MODULE = 'foo.spiders'
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0'
ROBOTSTXT_OBEY = False
DOWNLOADER_MIDDLEWARES = {
'foo.middlewares.FooDownloaderMiddleware': 543,
'foo.middlewares.TryMiddleware':500,
}
3)my middleware(i.e. a new class in middlewares.py):
import logging
import re
. . .
class TryMiddleware(object):
def __init__(self):
self.items_scraped = 0
self.target = ''
self.exceptions = []
@classmethod
def from_crawler(cls, crawler):
s = cls()
return s
def process_request(self, request, spider):
self.target = str(request)
# Just an example, at a later stage, these will be dynamically generated.
self.exceptions = ['Audible-Audiobook-Downloads','help']
for i in self.exceptions:
pattern = re.compile(r'[a-z0-9.:/-]+/{}/[0-9a-z.:/-]+'.format(re.escape(i)))
if i in self.target:
m = pattern.match(self.target)
# This is how I tried checking if the word is contained in the url,
# and see if regex caught it.
logger.info(f'\n*\nFound {m} in {target}\n*\n')
return None
4)This is an example of what my logger identifies:
* Found None in https://www.amazon.co.uk/gp/help/customer/display.html/ref=footer_cookies_notice?ie=UTF8&nodeId=201890250> *