First of all, study the following topics to have a general idea on how to be a good web-scraping citizen:
In general, first, you need to make sure you are legally allowed to scrape this particular web-site and follow their Terms of Use rules. Also, check web-site's robots.txt
and respect the rules listed there (for example, there can be Crawl-delay
directive set). Also, a good idea would be to contact web-site owner's and let them know what you are going to do or ask for the permission.
Identify yourself by explicitly specifying a User-Agent
header.
See also:
Should I carry it out in phases (scraping in smaller batches) ?
This is what DOWNLOAD_DELAY
setting is about:
The amount of time (in secs) that the downloader should wait before
downloading consecutive pages from the same website. This can be used
to throttle the crawling speed to avoid hitting servers too hard.
CONCURRENT_REQUESTS_PER_DOMAIN
and CONCURRENT_REQUESTS_PER_IP
are also relevant.
Tweak these settings for not hitting the web-site servers too often.
What and how should I log ?
The information that Scrapy puts on the console is pretty much extensive, but you may want to log all the errors and exceptions being raised while crawling. I personally like the idea of listening for spider_error
signal to be fired, see:
Which other points of attention should I take into account before
launching ?
You still have several things to think about.
At some point, you may get banned. There is always a reason for this, the most obvious would be that you would still crawl them too hard and they don't like it. There are certain techniques/tricks to avoid getting banned, like rotating IP addresses, using proxies, web-scraping in the cloud etc, see:
Another thing to worry about might be the crawling speed and scaling; at this point you may want to think about distributing your crawling process. This is there scrapyd
would help, see:
Still, make sure you are not crossing the line and staying on the legal side.