4

I'm about to scrape some 50.000 records of a real estate website (with Scrapy). The programming has been done and tested, and the database properly designed.

But I want to be prepared for unexpected events. So how do I go about actually running the scrape flawlessly and with minimal risk of failure and loss of time?

More specifically :

  • Should I carry it out in phases (scraping in smaller batches) ?
  • What and how should I log ?
  • Which other points of attention should I take into account before launching ?
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
S Leon
  • 331
  • 1
  • 4
  • 18

1 Answers1

5

First of all, study the following topics to have a general idea on how to be a good web-scraping citizen:


In general, first, you need to make sure you are legally allowed to scrape this particular web-site and follow their Terms of Use rules. Also, check web-site's robots.txt and respect the rules listed there (for example, there can be Crawl-delay directive set). Also, a good idea would be to contact web-site owner's and let them know what you are going to do or ask for the permission.

Identify yourself by explicitly specifying a User-Agent header.

See also:


Should I carry it out in phases (scraping in smaller batches) ?

This is what DOWNLOAD_DELAY setting is about:

The amount of time (in secs) that the downloader should wait before downloading consecutive pages from the same website. This can be used to throttle the crawling speed to avoid hitting servers too hard.

CONCURRENT_REQUESTS_PER_DOMAIN and CONCURRENT_REQUESTS_PER_IP are also relevant.

Tweak these settings for not hitting the web-site servers too often.

What and how should I log ?

The information that Scrapy puts on the console is pretty much extensive, but you may want to log all the errors and exceptions being raised while crawling. I personally like the idea of listening for spider_error signal to be fired, see:

Which other points of attention should I take into account before launching ? You still have several things to think about.

At some point, you may get banned. There is always a reason for this, the most obvious would be that you would still crawl them too hard and they don't like it. There are certain techniques/tricks to avoid getting banned, like rotating IP addresses, using proxies, web-scraping in the cloud etc, see:

Another thing to worry about might be the crawling speed and scaling; at this point you may want to think about distributing your crawling process. This is there scrapyd would help, see:

Still, make sure you are not crossing the line and staying on the legal side.

Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • 1
    +1 summarized what I was trying to write. =) Other settings that may be useful are ``CONCURRENT_REQUESTS_PER_{DOMAIN,IP}`` and there is also the [Autothrottle extension](http://scrapy.readthedocs.org/en/latest/topics/autothrottle.html). – Elias Dorneles Nov 15 '14 at 17:25
  • @elias thank you, added a note about `CONCURRENT_REQUESTS_PER_{DOMAIN,IP}` settings. Thanks for the point about `autothrottle`. – alecxe Nov 15 '14 at 17:41
  • Answer comprehensive on the ethical side. Still having some practical doubts though. For example : how do I minimize negative effects from a software/hardware problem during the scrape, or an internet connection loss ? – S Leon Nov 15 '14 at 18:48
  • And as to the batches, I wasn't exactly referring to DOWNLOAD_DELAY. I was asking myself if there's any advantage to first scraping the 10.000 first records, then the next 10.000 (in a different run) and so on.... – S Leon Nov 15 '14 at 18:53
  • 1
    @SLeon as for the internet connection loss, `Scrapy` has [`RetryMiddleware`](http://doc.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.contrib.downloadermiddleware.retry) built-in - by default, it would retry 2 times if the page wasn't downloaded. As for batches, I'm not really sure what the benefits of splitting the records into separate consecutive runs could be. – alecxe Nov 16 '14 at 04:40