1

I've been trying to make a custom middleware in Scrapy, which will flag urls containing certain patterns using regex. In short, there is a list of exceptions, and each url is checked against it. However, the middleware does not manage to properly identify the exceptions (it always returns a None result for re.match()).

I've tried implementing regex in a separate script, and it works. I'd really appreciate any ideas as to why this may be happening.

Here is the example situation:

1)Spider

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class AmazonSpider(CrawlSpider):
    name = 'amazon'
    allowed_domains = ['amazon.co.uk']
    start_urls = ['http://amazon.co.uk/']

    rules = (
        Rule(LinkExtractor(allow=''), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        i = {}
        i['url'] = response.url
        return i

2)Settings:

BOT_NAME = 'foo'

SPIDER_MODULES = ['foo.spiders']
NEWSPIDER_MODULE = 'foo.spiders'

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0'

ROBOTSTXT_OBEY = False
DOWNLOADER_MIDDLEWARES = {
    'foo.middlewares.FooDownloaderMiddleware': 543,
    'foo.middlewares.TryMiddleware':500,
}

3)my middleware(i.e. a new class in middlewares.py):

import logging
import re

. . .

class TryMiddleware(object):

def __init__(self):
    self.items_scraped = 0
    self.target = ''
    self.exceptions = []

@classmethod
def from_crawler(cls, crawler):
    s = cls()

    return s

def process_request(self, request, spider):
    self.target = str(request)

    # Just an example, at a later stage, these will be dynamically generated.
    self.exceptions = ['Audible-Audiobook-Downloads','help']

    for i in self.exceptions:
        pattern = re.compile(r'[a-z0-9.:/-]+/{}/[0-9a-z.:/-]+'.format(re.escape(i)))

        if i in self.target:
            m = pattern.match(self.target)
            # This is how I tried checking if the word is contained in the url,
            # and see if regex caught it.
            logger.info(f'\n*\nFound {m} in {target}\n*\n')

    return None

4)This is an example of what my logger identifies:

* Found None in https://www.amazon.co.uk/gp/help/customer/display.html/ref=footer_cookies_notice?ie=UTF8&nodeId=201890250> *

T the shirt
  • 79
  • 12

1 Answers1

2

Your code does work, you're trying to match Audible-Audiobook-Downloads, which returns None for the url in your question since it does not exist, that's what you're seeing. Then it goes to check if help exists in the url, which it does and it does already print that.

In the code below I check if m is not None and then print the full match.

import logging
import re

exceptions = ['Audible-Audiobook-Downloads','help']

for i in exceptions:
    pattern = re.compile(r'[a-z0-9.:/-]+/{}/[0-9a-z.:/-]+'.format(re.escape(i)))

    m = pattern.match("https://www.amazon.co.uk/gp/help/customer/display.html/ref=footer_cookies_notice?ie=UTF8&nodeId=201890250")
    if m:
        print(m.group(0))
Mark
  • 5,089
  • 2
  • 20
  • 31
  • Hi Mark, thanks for the advice. That's the thing, when I check regex in a separate script, like the one you provided, all works well. But when I implement it in middleware is when the problem occurs. Am I maybe making a mistake in how I refer to the request url? Also, I am using Python 3.7, but have also tried running the code in 3.5.3 (changed f'' to .format()), but the results are the same. – T the shirt Nov 09 '18 at 08:26