I am writing web spiders to scrap some products form websites using scrapy framework in python. I was wondering what's the best practices to calculate the coverage and missing items of the written spiders.
What i'm using right now is logging cases that's was unable to parse or raises exceptions.
As an example for that: when i expect a specific format for a price of a product or an address of a place and i find that my written Regular expressions doesn't match the scrapped strings. or when my xpath
selectors for specific data returns nothing.
Sometimes also when products are listed in one page or multiple ones i use curl
and grep
to roughly calculate the number of products. but i was wondering if there's better practices to handle this.