Create unit-test for method of scrapy CrawlSpider

Question

The initial problem

I am writing a CrawlSpider class (using the scrapy library) and rely on a lot of scrapy asynchronous magic to make it work. Here it is, stripped down:

class MySpider(CrawlSpider):
    rules = [Rule(LinkExtractor(allow='myregex'), callback='parse_page')]
    # some other class attributes

    def __init__(self, *args, **kwargs):
        super(MySpider, self).__init__(*args, **kwargs)
        self.response = None
        self.loader = None

    def parse_page_section(self):
        soup = BeautifulSoup(self.response.body, 'lxml')
        # Complicated scraping logic using BeautifulSoup
        self.loader.add_value(mykey, myvalue)

    # more methods parsing other sections of the page
    # also using self.response and self.loader

    def parse_page(self, response):
        self.response = response
        self.loader = ItemLoader(item=Item(), response=response)
        self.parse_page_section()
        # call other methods to collect more stuff
        self.loader.load_item()

The class attribute rule tells my spider to follow certain links and jump to a callback function once the web-pages are downloaded. My goal is to test the parsing method called parse_page_section without running the crawler or even making real HTTP requests.

What I tried

Instinctively, I turned myself to the mock library. I understand how you mock a function to test whether it has been called (with which arguments and if there were any side effects...), but that's not what I want. I want to instantiate a fake object MySpider and assign just enough attributes to be able to call parse_page_section method on it.

In the above example, I need a response object to instantiate my ItemLoader and specifically a self.response.body attribute to instantiate my BeautifulSoup. In principle, I could make fake objects like this:

from argparse import Namespace

my_spider = MySpider(CrawlSpider)
my_spider.response = NameSpace(body='<html>...</html>')

That works well to for the BeautifulSoup class but I would need to add more attributes to create an ItemLoader object. For more complex situations, it would become ugly and unmanageable.

My question

Is this the right approach altogether? I can't find similar examples on the web, so I think my approach may be wrong at a more fundamental level. Any insight would be greatly appreciated.

@ChrisP thanks for your edit. I did not put the `scrapy` label in the first place because I thought the question had to do with unit-testing in general. — cyberbikepunk, Apr 28 '16 at 14:37
It's definitely unit testing in general, but people who do lots of scraping might have some unique insights for unit testing scrapers. — ChrisP, Apr 28 '16 at 14:45
In this particular `CrawlSpider` case, I could get away with faking a response object. Doing it by hand is difficult, but could this help? http://requests-mock.readthedocs.io/en/latest/overview.html. Would this be a good approach? — cyberbikepunk, Apr 28 '16 at 15:18

score 4 · Accepted Answer · edited May 23 '17 at 11:48

4

Have you seen Spiders Contracts?

This allows you to test each callback of your spider without requiring a lot of code. For example:

def parse(self, response):
    """ This function parses a sample response. Some contracts are mingled
    with this docstring.

    @url http://www.amazon.com/s?field-keywords=selfish+gene
    @returns items 1 16
    @returns requests 0 0
    @scrapes Title Author Year Price
    """

Use the check command to run the contract checks.

Look at this answer, if you want something even bigger.

edited May 23 '17 at 11:48

Community

1
1

answered Apr 28 '16 at 15:27

Danil

4,781
1
35
50

I guess it makes sense to go for *real life* (integration) tests instead of unit testing since the website itself can change. In essence, the fact that your unit tests work doesn't ensure that your scraping works. Thanks for your suggestion. – cyberbikepunk Apr 28 '16 at 16:12
There is still value in unit-tests though, as sanity checks whilst coding at the very least. The other answer you provide (http://stackoverflow.com/questions/6456304/scrapy-unit-testing/12741030#12741030] shows how to fake a response object in a better way, by actually using `scrapy` `Request` and `Response` objects. Nice tip. – cyberbikepunk Apr 28 '16 at 16:51

Create unit-test for method of scrapy CrawlSpider

The initial problem

What I tried

My question

1 Answers1