I wrote a script that scrapes a particular website. However, because of the way this website is designed, I have to make a separate HTTP request for every page. Given that there are about 2,000 pages I need to scrape, I decided to test my script on just the first 100.
I added some delay inbetween requests to not overload the server, but after about 70th request, the host rejects to connect. If I restart the script, it works just fine until, again, the 70th request.
I've tried adding a 10 minute pause before retrying when the host refuses to connect, but that doesn't seem to work. What would be the best way to circumvent this anti-scraping measure?
Below is an example of how my script looks.
URL = 'http://www.url/here/{page}'
for i in range(1, 101):
try:
r = requests.get(URL.format(page=i))
except URLError:
time.sleep(600) # Wait 10 minutes before retrying
r = requests.get(URL.format(page=i))
finally:
pause = random.randint(10, 20)
time.sleep(pause)