4

The initial problem

I am writing a CrawlSpider class (using the scrapy library) and rely on a lot of scrapy asynchronous magic to make it work. Here it is, stripped down:

class MySpider(CrawlSpider):
    rules = [Rule(LinkExtractor(allow='myregex'), callback='parse_page')]
    # some other class attributes

    def __init__(self, *args, **kwargs):
        super(MySpider, self).__init__(*args, **kwargs)
        self.response = None
        self.loader = None

    def parse_page_section(self):
        soup = BeautifulSoup(self.response.body, 'lxml')
        # Complicated scraping logic using BeautifulSoup
        self.loader.add_value(mykey, myvalue)

    # more methods parsing other sections of the page
    # also using self.response and self.loader

    def parse_page(self, response):
        self.response = response
        self.loader = ItemLoader(item=Item(), response=response)
        self.parse_page_section()
        # call other methods to collect more stuff
        self.loader.load_item()

The class attribute rule tells my spider to follow certain links and jump to a callback function once the web-pages are downloaded. My goal is to test the parsing method called parse_page_section without running the crawler or even making real HTTP requests.

What I tried

Instinctively, I turned myself to the mock library. I understand how you mock a function to test whether it has been called (with which arguments and if there were any side effects...), but that's not what I want. I want to instantiate a fake object MySpider and assign just enough attributes to be able to call parse_page_section method on it.

In the above example, I need a response object to instantiate my ItemLoader and specifically a self.response.body attribute to instantiate my BeautifulSoup. In principle, I could make fake objects like this:

from argparse import Namespace

my_spider = MySpider(CrawlSpider)
my_spider.response = NameSpace(body='<html>...</html>')

That works well to for the BeautifulSoup class but I would need to add more attributes to create an ItemLoader object. For more complex situations, it would become ugly and unmanageable.

My question

Is this the right approach altogether? I can't find similar examples on the web, so I think my approach may be wrong at a more fundamental level. Any insight would be greatly appreciated.

cyberbikepunk
  • 1,302
  • 11
  • 14
  • @ChrisP thanks for your edit. I did not put the `scrapy` label in the first place because I thought the question had to do with unit-testing in general. – cyberbikepunk Apr 28 '16 at 14:37
  • It's definitely unit testing in general, but people who do lots of scraping might have some unique insights for unit testing scrapers. – ChrisP Apr 28 '16 at 14:45
  • In this particular `CrawlSpider` case, I could get away with faking a response object. Doing it by hand is difficult, but could this help? http://requests-mock.readthedocs.io/en/latest/overview.html. Would this be a good approach? – cyberbikepunk Apr 28 '16 at 15:18

1 Answers1

4

Have you seen Spiders Contracts?

This allows you to test each callback of your spider without requiring a lot of code. For example:

def parse(self, response):
    """ This function parses a sample response. Some contracts are mingled
    with this docstring.

    @url http://www.amazon.com/s?field-keywords=selfish+gene
    @returns items 1 16
    @returns requests 0 0
    @scrapes Title Author Year Price
    """

Use the check command to run the contract checks.

Look at this answer, if you want something even bigger.

Community
  • 1
  • 1
Danil
  • 4,781
  • 1
  • 35
  • 50
  • I guess it makes sense to go for *real life* (integration) tests instead of unit testing since the website itself can change. In essence, the fact that your unit tests work doesn't ensure that your scraping works. Thanks for your suggestion. – cyberbikepunk Apr 28 '16 at 16:12
  • There is still value in unit-tests though, as sanity checks whilst coding at the very least. The other answer you provide (http://stackoverflow.com/questions/6456304/scrapy-unit-testing/12741030#12741030] shows how to fake a response object in a better way, by actually using `scrapy` `Request` and `Response` objects. Nice tip. – cyberbikepunk Apr 28 '16 at 16:51