Circumventing connection rejection when webscraping

Question

I wrote a script that scrapes a particular website. However, because of the way this website is designed, I have to make a separate HTTP request for every page. Given that there are about 2,000 pages I need to scrape, I decided to test my script on just the first 100.

I added some delay inbetween requests to not overload the server, but after about 70th request, the host rejects to connect. If I restart the script, it works just fine until, again, the 70th request.

I've tried adding a 10 minute pause before retrying when the host refuses to connect, but that doesn't seem to work. What would be the best way to circumvent this anti-scraping measure?

Below is an example of how my script looks.

URL = 'http://www.url/here/{page}'
for i in range(1, 101):
    try:
        r = requests.get(URL.format(page=i))
    except URLError:
        time.sleep(600) # Wait 10 minutes before retrying
        r = requests.get(URL.format(page=i))
    finally:
        pause = random.randint(10, 20)
        time.sleep(pause)

@AdrianoMartins It's just a `urllib.request.URLError` due to the host rejecting a conection. I can't post the exact error because I'm not at my PC, but it's a simple error that I don't think I need to post it. I assume the error is caused by me trying to make multiple requests in a short span of time, but seeing that adding delays to my script doesn't help, I'm not exactly sure what's wrong. Maybe I need to try switching up the IP every X number of calls? — spicypumpkin, Mar 31 '18 at 04:21
The error is way too important - without it there is no way to be certain on what is causing the error. If a simple restart right after a failure fix the issue, it seems unlikely to be a host scrap limit. — Adriano Martins, Mar 31 '18 at 04:30
@AdrianoMartins `URLError: ` Looked up `WinError 10061`, according to [this question](https://stackoverflow.com/questions/20437701/winerror-10061-no-connection-could-be-made?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa) it might have something to do with the browser setting. I am using PhantomJS because I cannot get the necessary data from the website without JS, meaning that a simple HTTP request does not return the full HTML. I am not running a proxy either. — spicypumpkin, Mar 31 '18 at 05:42
@AdrianoMartins When I restart the script immediately after the exception, it works just fine and runs until it's at 70-somethingth trial. I wait 10 minutes before retrying when the error happens, but at that point the connection is refused. I believe I clearly explained the error in the question and the error itself does not give much information. Based on [this question](https://stackoverflow.com/questions/28869168/connectionrefusederror-winerror-10061-no-connection-could-be-made-because-the), the issue may be related to the number of requests, which is exactly the case I have described. — spicypumpkin, Mar 31 '18 at 05:44
@AdrianoMartins So basically: `URLError is caught -> waits 10 minutes -> retry the same URL -> same URLError is thrown, which is not caught -> restart the script after termination -> rinse and repeat` — spicypumpkin, Mar 31 '18 at 05:45
It seems that this is a server-sided protection then, Ain't much you can't do to avoid it. You could try to use `session = requests.Session(); session.get()` instead of using `requests.get()` and reusing the same session to make the block happen later, but most likely it will still happen. — Adriano Martins, Mar 31 '18 at 17:02

Circumventing connection rejection when webscraping

0 Answers0