Test scrapy spider still working - find page changes

Question

How can I test a scrapy spider against online data.

I now from this post that it is possible to test a spider against offline data.

My target is to check if my spider still extracts the right data from a page, or if the page changed. I extract the data via XPath and sometimes the page receives and update and my scraper is no longer working. I would love to have the test as close to my code as possible, eg. using the spider and scrapy setup and just hook into the parse method.

Have you tried Spider Contracts? http://doc.scrapy.org/en/latest/topics/contracts.html — Valdir Stumm Junior, Feb 07 '16 at 18:50
Yes, thank you. It is my plan B, still I would like a real "test" because I want to do more after the check. — lony, Feb 07 '16 at 20:08
Write a scrapy pipeline for the data values that you expect. If your scraper doesn't scrape the expected value for the field then you should raise the scrapy [DropItem](http://scrapy.readthedocs.org/en/latest/topics/exceptions.html#dropitem) exception — , Feb 16 '16 at 03:59
Does that address your issue? --> http://stackoverflow.com/questions/6456304/scrapy-unit-testing/38214137#38214137 — Hadrien, Jul 05 '16 at 23:14

Adi Daryanani · Accepted Answer · 2017-07-06T16:35:10.610

Referring to the link you provided, you could try this method for online testing which I used for my problem which was similar to yours. All you have to do is instead of reading the requests from a file you can use the Requests library to fetch the live webpage for you and compose a scrapy response from the response you get from Requests like below

import os
import requests

from scrapy.http import Response, Request

def online_response_from_url (url=None):

    if not url:
        url = 'http://www.example.com'

    request = Request(url=url)


    oresp = requests.get(url)

    response = TextResponse(url=url, request=request,
    body=oresp.text, encoding = 'utf-8')

    return response

Test scrapy spider still working - find page changes

1 Answers1

Linked