2

How can I test a scrapy spider against online data.

I now from this post that it is possible to test a spider against offline data.

My target is to check if my spider still extracts the right data from a page, or if the page changed. I extract the data via XPath and sometimes the page receives and update and my scraper is no longer working. I would love to have the test as close to my code as possible, eg. using the spider and scrapy setup and just hook into the parse method.

Community
  • 1
  • 1
lony
  • 6,733
  • 11
  • 60
  • 92
  • Have you tried Spider Contracts? http://doc.scrapy.org/en/latest/topics/contracts.html – Valdir Stumm Junior Feb 07 '16 at 18:50
  • Yes, thank you. It is my plan B, still I would like a real "test" because I want to do more after the check. – lony Feb 07 '16 at 20:08
  • Write a scrapy pipeline for the data values that you expect. If your scraper doesn't scrape the expected value for the field then you should raise the scrapy [DropItem](http://scrapy.readthedocs.org/en/latest/topics/exceptions.html#dropitem) exception –  Feb 16 '16 at 03:59
  • Does that address your issue? --> http://stackoverflow.com/questions/6456304/scrapy-unit-testing/38214137#38214137 – Hadrien Jul 05 '16 at 23:14
  • Sounds like it, I will look into it! Thanks – lony Jul 06 '16 at 11:31

1 Answers1

2

Referring to the link you provided, you could try this method for online testing which I used for my problem which was similar to yours. All you have to do is instead of reading the requests from a file you can use the Requests library to fetch the live webpage for you and compose a scrapy response from the response you get from Requests like below

import os
import requests

from scrapy.http import Response, Request

def online_response_from_url (url=None):

    if not url:
        url = 'http://www.example.com'

    request = Request(url=url)


    oresp = requests.get(url)

    response = TextResponse(url=url, request=request,
    body=oresp.text, encoding = 'utf-8')

    return response