Responsible time delays - web crawling

Question

What is a responsible / ethical time delay to put in a web crawler that only crawls one root page?

I'm using time.sleep(#) between the following calls
requests.get(url)

I'm looking for a rough idea on what timescales are: 1. Way too conservative 2. Standard 3. Going to cause problems / get you noticed

I want to touch every page (at least 20,000, probably a lot more) meeting certain criteria. Is this feasible within a reasonable timeframe?

EDIT
This question is less about avoiding being blocked (though any relevant info. would be appreciated) and rather what time delays do not cause issues to the host website / servers. I've tested with 10 second time delays and around 50 pages. I just don't have a clue if I'm being over cautious.

score 1 · Accepted Answer · answered Aug 22 '17 at 02:36

1

I'd check their robots.txt. If it lists a crawl-delay, use it! If not, try something reasonable (this depends on the size of the page). If it's a large page, try 2/second. If it's a simple .txt file, 10/sec should be fine.

If all else fails, contact the site owner to see what they're capable of handling nicely.

_{(I'm assuming this is an amateur server with minimal bandwidth)}

answered Aug 22 '17 at 02:36

jhpratt

6,841
16
40
50

How do you find the robots.txt? I have in view source – Andrew Allen Aug 22 '17 at 02:39
What if the site doesn't have one? I searched the www.xxxxxx.co.uk/robots.txt – Andrew Allen Aug 22 '17 at 02:43
1

Then keep reading! Use what you would think reasonable. I know that isn't too helpful, but without much information that's the best I can say. – jhpratt Aug 22 '17 at 02:44

Responsible time delays - web crawling

1 Answers1