Scrapy : preventive measures before running the scrape

Question

I'm about to scrape some 50.000 records of a real estate website (with Scrapy). The programming has been done and tested, and the database properly designed.

But I want to be prepared for unexpected events. So how do I go about actually running the scrape flawlessly and with minimal risk of failure and loss of time?

More specifically :

Should I carry it out in phases (scraping in smaller batches) ?
What and how should I log ?
Which other points of attention should I take into account before launching ?

score 5 · Accepted Answer · edited May 23 '17 at 12:17

First of all, study the following topics to have a general idea on how to be a good web-scraping citizen:

In general, first, you need to make sure you are legally allowed to scrape this particular web-site and follow their Terms of Use rules. Also, check web-site's robots.txt and respect the rules listed there (for example, there can be Crawl-delay directive set). Also, a good idea would be to contact web-site owner's and let them know what you are going to do or ask for the permission.

Identify yourself by explicitly specifying a User-Agent header.

Scrapy : preventive measures before running the scrape

1 Answers1