75

I'd like to implement some unit tests in a Scrapy (screen scraper/web crawler). Since a project is run through the "scrapy crawl" command I can run it through something like nose. Since scrapy is built on top of twisted can I use its unit testing framework Trial? If so, how? Otherwise I'd like to get nose working.

Update:

I've been talking on Scrapy-Users and I guess I am supposed to "build the Response in the test code, and then call the method with the response and assert that [I] get the expected items/requests in the output". I can't seem to get this to work though.

I can build a unit-test test class and in a test:

  • create a response object
  • try to call the parse method of my spider with the response object

However it ends up generating this traceback. Any insight as to why?

Huge
  • 661
  • 7
  • 14
ciferkey
  • 2,064
  • 3
  • 20
  • 28

10 Answers10

82

The way I've done it is create fake responses, this way you can test the parse function offline. But you get the real situation by using real HTML.

A problem with this approach is that your local HTML file may not reflect the latest state online. So if the HTML changes online you may have a big bug, but your test cases will still pass. So it may not be the best way to test this way.

My current workflow is, whenever there is an error I will sent an email to admin, with the url. Then for that specific error I create a html file with the content which is causing the error. Then I create a unittest for it.

This is the code I use to create sample Scrapy http responses for testing from an local html file:

# scrapyproject/tests/responses/__init__.py

import os

from scrapy.http import Response, Request

def fake_response_from_file(file_name, url=None):
    """
    Create a Scrapy fake HTTP response from a HTML file
    @param file_name: The relative filename from the responses directory,
                      but absolute paths are also accepted.
    @param url: The URL of the response.
    returns: A scrapy HTTP response which can be used for unittesting.
    """
    if not url:
        url = 'http://www.example.com'

    request = Request(url=url)
    if not file_name[0] == '/':
        responses_dir = os.path.dirname(os.path.realpath(__file__))
        file_path = os.path.join(responses_dir, file_name)
    else:
        file_path = file_name

    file_content = open(file_path, 'r').read()

    response = Response(url=url,
        request=request,
        body=file_content)
    response.encoding = 'utf-8'
    return response

The sample html file is located in scrapyproject/tests/responses/osdir/sample.html

Then the testcase could look as follows: The test case location is scrapyproject/tests/test_osdir.py

import unittest
from scrapyproject.spiders import osdir_spider
from responses import fake_response_from_file

class OsdirSpiderTest(unittest.TestCase):

    def setUp(self):
        self.spider = osdir_spider.DirectorySpider()

    def _test_item_results(self, results, expected_length):
        count = 0
        permalinks = set()
        for item in results:
            self.assertIsNotNone(item['content'])
            self.assertIsNotNone(item['title'])
        self.assertEqual(count, expected_length)

    def test_parse(self):
        results = self.spider.parse(fake_response_from_file('osdir/sample.html'))
        self._test_item_results(results, 10)

That's basically how I test my parsing methods, but its not only for parsing methods. If it gets more complex I suggest looking at Mox

Sam Stoelinga
  • 4,881
  • 7
  • 39
  • 54
  • 2
    Nice approach for offline testing. What about run offline tests to make sure you don't have code flaws and then run online tests to make sure the site changes don't break your program? – Igor Medeiros Sep 25 '13 at 19:12
  • @Medeiros thats the way i'm doing it in another project right now. I tag tests with @integration=1 so that I don't have to always run all tests. I'm doing this with the nosetests tagging plugin. – Sam Stoelinga Feb 04 '14 at 14:26
  • 1
    @SamStoelinga Can I also test against real data? If so how can I "fetch" the response using scrapy inside the unit test? I would love to check if my spider still gathers all information from a changed side. – lony Feb 07 '16 at 00:49
  • I made a separate question out of it [here](http://stackoverflow.com/questions/35256334/test-scrapy-spider-still-working-find-page-changes). – lony Feb 07 '16 at 16:59
  • I strongly suggest using Betamax to achieve that: http://stackoverflow.com/questions/6456304/scrapy-unit-testing/38214137#38214137 – Hadrien Jul 06 '16 at 19:26
  • This `from scrapyproject.spiders import osdir_spider` is not valid. How can i import the spider? – ji-ruh Aug 18 '17 at 08:20
  • @ji-ruh well that depends where your spider is. I assume you wrote your own spider so you need to change the path to your own spider... I suggest re-reading the answer and reading the scrapy docs. You can't just copy paste this answer you need to understand it. Also this answer is from 5 years ago there may be better ways to do it today with scrapy. – Sam Stoelinga Aug 18 '17 at 19:11
  • I imported it using this way. `from articles.spiders.spidername import SpiderName`. thanks – ji-ruh Aug 21 '17 at 15:05
  • You can eliminate `def fake_response_from_file` by doing `from scrapy.selector import Selector` then is your `setUp` function have: `self.fake_response = Selector(text=open(file, 'r').read())` – b_dev Feb 08 '18 at 23:48
  • In Python 3, the relative import won't work if the test file is located in a child directory. If you want to do this, build your project into a package using `pip install -e my_pck_name` – Woody1193 Aug 08 '18 at 05:23
  • If you want to add ```meta```, add it to the ```Request``` instance creation: ```request = Request(url=url, meta=meta)``` – Nikolay Shindarov Apr 30 '19 at 13:43
  • This is the correct solution, but I personally have always seen testing when it comes to scraping as a waste of time. Maybe I'm severely misguided here but most of the time writing and doing the tests takes just as long as making the script and is dependent on a known working version of the website...so I have a hard time seeing how testing for changes against a known working version is a test. – eusid May 13 '19 at 19:09
  • @eusid I felt it improved my productivity by having a super small and simple test to quickly test my scraping code. – Sam Stoelinga Dec 30 '19 at 22:35
  • I know old thread but how do I manage that the items are send to the pipelines? – Tobias Mayr Jan 22 '21 at 17:16
  • To test the items that go through the entire Scrapy process (including pipelines) from a given URL, I added an extension that writes spider info (e.g. items collected, urls visited, etc.) to a file using jsonpickle. Then, in the testing suite, I submit test urls using scrapyd (together with a setting enabling the extension), checking when each is complete, and running the tests on what was written to the jsonpickle'd file. It's super convoluted but couldn't find anything better after days of looking. – platelminto Sep 06 '21 at 16:25
27

I use Betamax to run test on real site the first time and keep http responses locally so that next tests run super fast after:

Betamax intercepts every request you make and attempts to find a matching request that has already been intercepted and recorded.

When you need to get latest version of site, just remove what betamax has recorded and re-run test.

Example:

from scrapy import Spider, Request
from scrapy.http import HtmlResponse


class Example(Spider):
    name = 'example'

    url = 'http://doc.scrapy.org/en/latest/_static/selectors-sample1.html'

    def start_requests(self):
        yield Request(self.url, self.parse)

    def parse(self, response):
        for href in response.xpath('//a/@href').extract():
            yield {'image_href': href}


# Test part
from betamax import Betamax
from betamax.fixtures.unittest import BetamaxTestCase


with Betamax.configure() as config:
    # where betamax will store cassettes (http responses):
    config.cassette_library_dir = 'cassettes'
    config.preserve_exact_body_bytes = True


class TestExample(BetamaxTestCase):  # superclass provides self.session

    def test_parse(self):
        example = Example()

        # http response is recorded in a betamax cassette:
        response = self.session.get(example.url)

        # forge a scrapy response to test
        scrapy_response = HtmlResponse(body=response.content, url=example.url)

        result = example.parse(scrapy_response)

        self.assertEqual({'image_href': u'image1.html'}, result.next())
        self.assertEqual({'image_href': u'image2.html'}, result.next())
        self.assertEqual({'image_href': u'image3.html'}, result.next())
        self.assertEqual({'image_href': u'image4.html'}, result.next())
        self.assertEqual({'image_href': u'image5.html'}, result.next())

        with self.assertRaises(StopIteration):
            result.next()

FYI, I discover betamax at pycon 2015 thanks to Ian Cordasco's talk.

Hadrien
  • 1,479
  • 2
  • 14
  • 18
22

The newly added Spider Contracts are worth trying. It gives you a simple way to add tests without requiring a lot of code.

Shane Evans
  • 2,234
  • 16
  • 15
  • 8
    It is very poor at the current moment. You have to write your own contracts to check something more complicated than *parsing of this page returns N items with fields `foo` and `bar` filled with any data* – Anton Egorov Oct 21 '13 at 09:30
  • It doesn't serve the purpose. I tried changing my selectors and force the empty responses still passed all contracts – Raheel Feb 24 '18 at 12:47
10

This is a very late answer but I've been annoyed with scrapy testing so I wrote scrapy-test a framework for testing scrapy crawlers against defined specifications.

It works by defining test specifications rather than static output. For example if we are crawling this sort of item:

{
    "name": "Alex",
    "age": 21,
    "gender": "Female",
}

We can defined scrapy-test ItemSpec:

from scrapytest.tests import Match, MoreThan, LessThan
from scrapytest.spec import ItemSpec

class MySpec(ItemSpec):
    name_test = Match('{3,}')  # name should be at least 3 characters long
    age_test = Type(int), MoreThan(18), LessThan(99)
    gender_test = Match('Female|Male')

There's also same idea tests for scrapy stats as StatsSpec:

from scrapytest.spec import StatsSpec
from scrapytest.tests import Morethan

class MyStatsSpec(StatsSpec):
    validate = {
        "item_scraped_count": MoreThan(0),
    }

Afterwards it can be run against live or cached results:

$ scrapy-test 
# or
$ scrapy-test --cache

I've been running cached runs for development changes and daily cronjobs for detecting website changes.

Granitosaurus
  • 20,530
  • 5
  • 57
  • 82
5

I'm using Twisted's trial to run tests, similar to Scrapy's own tests. It already starts a reactor, so I make use of the CrawlerRunner without worrying about starting and stopping one in the tests.

Stealing some ideas from the check and parse Scrapy commands I ended up with the following base TestCase class to run assertions against live sites:

from twisted.trial import unittest

from scrapy.crawler import CrawlerRunner
from scrapy.http import Request
from scrapy.item import BaseItem
from scrapy.utils.spider import iterate_spider_output

class SpiderTestCase(unittest.TestCase):
    def setUp(self):
        self.runner = CrawlerRunner()

    def make_test_class(self, cls, url):
        """
        Make a class that proxies to the original class,
        sets up a URL to be called, and gathers the items
        and requests returned by the parse function.
        """
        class TestSpider(cls):
            # This is a once used class, so writing into
            # the class variables is fine. The framework
            # will instantiate it, not us.
            items = []
            requests = []

            def start_requests(self):
                req = super(TestSpider, self).make_requests_from_url(url)
                req.meta["_callback"] = req.callback or self.parse
                req.callback = self.collect_output
                yield req

            def collect_output(self, response):
                try:
                    cb = response.request.meta["_callback"]
                    for x in iterate_spider_output(cb(response)):
                        if isinstance(x, (BaseItem, dict)):
                            self.items.append(x)
                        elif isinstance(x, Request):
                            self.requests.append(x)
                except Exception as ex:
                    print("ERROR", "Could not execute callback: ",     ex)
                    raise ex

                # Returning any requests here would make the     crawler follow them.
                return None

        return TestSpider

Example:

@defer.inlineCallbacks
def test_foo(self):
    tester = self.make_test_class(FooSpider, 'https://foo.com')
    yield self.runner.crawl(tester)
    self.assertEqual(len(tester.items), 1)
    self.assertEqual(len(tester.requests), 2)

or perform one request in the setup and run multiple tests against the results:

@defer.inlineCallbacks
def setUp(self):
    super(FooTestCase, self).setUp()
    if FooTestCase.tester is None:
        FooTestCase.tester = self.make_test_class(FooSpider, 'https://foo.com')
        yield self.runner.crawl(self.tester)

def test_foo(self):
    self.assertEqual(len(self.tester.items), 1)
Aa'Koshh
  • 361
  • 3
  • 9
4

Slightly simpler, by removing the def fake_response_from_file from the chosen answer:

import unittest
from spiders.my_spider import MySpider
from scrapy.selector import Selector


class TestParsers(unittest.TestCase):


    def setUp(self):
        self.spider = MySpider(limit=1)
        self.html = Selector(text=open("some.htm", 'r').read())


    def test_some_parse(self):
        expected = "some-text"
        result = self.spider.some_parse(self.html)
        self.assertEqual(result, expected)


if __name__ == '__main__':
    unittest.main()
b_dev
  • 2,568
  • 6
  • 34
  • 43
  • This works for me however if my parse function has a check on `response.url`, it is throwing error saying `'Selector' object has no attribute 'url'` – addicted Aug 08 '20 at 09:41
3

I'm using scrapy 1.3.0 and the function: fake_response_from_file, raise an error:

response = Response(url=url, request=request, body=file_content)

I get:

raise AttributeError("Response content isn't text")

The solution is to use TextResponse instead, and it works ok, as example:

response = TextResponse(url=url, request=request, body=file_content)     

Thanks a lot.

erb
  • 14,503
  • 5
  • 30
  • 38
Kfeina
  • 39
  • 1
3

Similar to Hadrien's answer but for pytest: pytest-vcr.

import requests
import pytest
from scrapy.http import HtmlResponse

@pytest.mark.vcr()
def test_parse(url, target):
    response = requests.get(url)
    scrapy_response = HtmlResponse(url, body=response.content)
    assert Spider().parse(scrapy_response) == target

2

You can follow this snippet from the scrapy site to run it from a script. Then you can make any kind of asserts you'd like on the returned items.

kolen
  • 2,752
  • 2
  • 27
  • 35
ciferkey
  • 2,064
  • 3
  • 20
  • 28
1

https://github.com/ThomasAitken/Scrapy-Testmaster

This is a package I wrote that significantly extends the functionality of the Scrapy Autounit library and takes it in a different direction (allowing for easy dynamic updating of testcases and merging the processes of debugging/testcase-generation). It also includes a modified version of the Scrapy parse command (https://docs.scrapy.org/en/latest/topics/commands.html#std-command-parse)

Noam Hudson
  • 117
  • 9
  • Can you explain a bit more? – Dieter Meemken May 22 '20 at 11:49
  • In short, the idea is that you can devise custom rules for validating your output, then you can run either one-off requests to specific urls or run your full spider, and it will automatically check the results of these requests against your custom rules. If the results pass your custom rules, then testcases are generated which can in future be run statically to check that modifications to your code have not broken anything. Furthermore, if you want to check that the website has changed, then you also have the option to re-create the original requests to generate fresh testcases. – Noam Hudson May 26 '20 at 01:59