1

I am writing web spiders to scrap some products form websites using scrapy framework in python. I was wondering what's the best practices to calculate the coverage and missing items of the written spiders.

What i'm using right now is logging cases that's was unable to parse or raises exceptions. As an example for that: when i expect a specific format for a price of a product or an address of a place and i find that my written Regular expressions doesn't match the scrapped strings. or when my xpath selectors for specific data returns nothing.

Sometimes also when products are listed in one page or multiple ones i use curl and grep to roughly calculate the number of products. but i was wondering if there's better practices to handle this.

Hady Elsahar
  • 2,121
  • 4
  • 29
  • 47

1 Answers1

1

The common approach is, yes, to use logging to log the error and exit the callback by returning nothing.

Example (product price is required):

loader = ProductLoader(ProductItem(), response=response)
loader.add_xpath('price', '//span[@class="price"]/text()')
if not loader.get_output_value('price'):
    log.msg("Error fetching product price", level=log.ERROR)
    return

You can also use signals to catch and log all kind of exceptions happened while crawling, see:

This basically follows the Easier to ask for forgiveness than permission principle when you let the spider fail and catch and process the error in a single, one particular place - a signal handler.


Other thoughts:

  • you can even place the response urls and error tracebacks into a database for a following review - this is still "logging", but in a structured manner which can be more convenient to go through later
  • a good idea might be to create custom exceptions to represent different crawling errors, for instance: MissingRequiredFieldError, InvalidFieldFormatError which you can raise in case crawled fields haven't passed validation.
Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • i personally prefer EAFP style more, so this answer is a relief, but does that mean that if i scrap 20 detail from 1 web page my parse function with be full of 20 try and except clauses ? – Hady Elsahar Nov 14 '14 at 04:10
  • @HadyElsahar nope, in case you have a `spider_error` signal handler, all the exceptions raised while crawling will be "available" in the handler. Having custom exceptions raised should ease the error analyzing and processing.. – alecxe Nov 14 '14 at 04:14
  • yes but i prefer handling each detail in try catch clauses alone in order for the spider to complete scraping the rest of the details if it failed to get [optional item]. , moreover for dedicated log messages of missed cases, which part actually failed.(for example getting the price text or parsing it ) – Hady Elsahar Nov 15 '14 at 05:08
  • given so, does it look ugly to have lots of try, except clauses with log messages all over your code ? or that is common in EAFP ? – Hady Elsahar Nov 15 '14 at 05:09
  • 1
    @HadyElsahar I think it's pretty much ok if you want to have more control inside the callback and handling exceptions on a more "global" level is not an option. – alecxe Nov 15 '14 at 05:11