Scraping in Python - Preventing IP ban

Question

I am using Python to scrape pages. Until now I didn't have any complicated issues.

The site that I'm trying to scrape uses a lot of security checks and have some mechanism to prevent scraping.

Using Requests and lxml I was able to scrape about 100-150 pages before getting banned by IP. Sometimes I even get ban on first request (new IP, not used before, different C block). I have tried with spoofing headers, randomize time between requests, still the same.

I have tried with Selenium and I got much better results. With Selenium I was able to scrape about 600-650 pages before getting banned. Here I have also tried to randomize requests (between 3-5 seconds, and make time.sleep(300) call on every 300th request). Despite that, Im getting banned.

From here I can conclude that site have some mechanism where they ban IP if it requested more than X pages in one open browser session or something like that.

Based on your experience what else should I try? Will closing and opening browser in Selenium help (for example after every 100th requests close and open browser). I was thinking about trying with proxies but there are about million of pages and it will be very expansive.

score 17 · Answer 1 · edited May 23 '17 at 10:31

17

If you would switch to the Scrapy web-scraping framework, you would be able to reuse a number of things that were made to prevent and tackle banning:

the built-in AutoThrottle extension:

This is an extension for automatically throttling crawling speed based on load of both the Scrapy server and the website you are crawling.

rotating user agents with scrapy-fake-useragent middleware:

Use a random User-Agent provided by fake-useragent every request

rotating IP addresses:
- Setting Scrapy proxy middleware to rotate on each request
- scrapy-proxies
you can also run it via local proxy & TOR:
- Scrapy: Run Using TOR and Multiple Agents

edited May 23 '17 at 10:31

Community

1
1

answered Feb 01 '16 at 15:12

alecxe

462,703
120
1,088
1,195

Im not fan of Scrapy, but I might give it a try, although I'm not sure it will help me. I have used all of the things you recommend and was not able to pass the limit. – RhymeGuy Feb 01 '16 at 15:16
@RhymeGuy it's just a general answer so that it may help others visiting the topic. In your case, I would say switching IPs via a proxy is the way to go. Thanks. – alecxe Feb 01 '16 at 15:18

score 12 · Answer 2 · edited Aug 25 '19 at 07:31

12

I had this problem too. I used urllib with tor in python3.

download and install tor browser
testing tor

open terminal and type:

curl --socks5-hostname localhost:9050 <http://site-that-blocked-you.com>

if you see result it's worked.

Now we should test in python. Now run this code

import socks
import socket
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup

#set socks5 proxy to use tor

socks.set_default_proxy(socks.SOCKS5, "localhost", 9050)
socket.socket = socks.socksocket
req = Request('http://check.torproject.org', headers={'User-Agent': 'Mozilla/5.0', })
html = urlopen(req).read()
soup = BeautifulSoup(html, 'html.parser')
print(soup('title')[0].get_text())

if you see

Congratulations. This browser is configured to use Tor.

it worked in python too and this means you are using tor for web scraping.

edited Aug 25 '19 at 07:31

Gruber

2,196
5
28
50

answered Mar 02 '18 at 08:02

Mohammad Reza

693
9
16

3

Just want to update, the tor browser is now listening to port 9150 instead of 9050. – Hanzhou Tang Sep 26 '20 at 03:23
1

failed to connect to localhost. – Harsh Vardhan Oct 08 '20 at 12:36
after all these actions I was still banned by IP address – muinh Feb 02 '21 at 17:52

score 6 · Answer 3 · edited May 11 '19 at 05:43

6

You could use proxies.

You can buy several hundred IPs for very cheap, and use selenium as you previously have done. Furthermore I suggest varying the browser your use and other user-agent parameters.

You could iterate over using a single IP address to load only x number of pages and stopping prior to getting banned.

def load_proxy(PROXY_HOST,PROXY_PORT):
        fp = webdriver.FirefoxProfile()
        fp.set_preference("network.proxy.type", 1)
        fp.set_preference("network.proxy.http",PROXY_HOST)
        fp.set_preference("network.proxy.http_port",int(PROXY_PORT))
        fp.set_preference("general.useragent.override","whater_useragent")
        fp.update_preferences()
        return webdriver.Firefox(firefox_profile=fp)

edited May 11 '19 at 05:43

Rahul

10,830
4
53
88

answered Feb 01 '16 at 14:48

Parsa

3,054
3
19
35

1

Can you recommend proxy service which i might use? – RhymeGuy Feb 01 '16 at 15:17
Thanks, Service looks okay, but not so cheap. Im not even sure that money that I will give for proxy will cover value of information that I will gather. Will have to think again. – RhymeGuy Feb 01 '16 at 15:26
if the pages you are searching for are cached by google you could search for them in google and access the static version cached by the google crawler? – Parsa Feb 03 '16 at 12:53
Unfortunately site use login form and most of the pages cannot be accessed without login. Therefore Google cannot cache them. Seems that using proxy service is only reasonable option in this case. – RhymeGuy Feb 03 '16 at 14:13
How we can change IP using chrome web-driver selenium and python – Mobin Al Hassan May 03 '20 at 12:50

Scraping in Python - Preventing IP ban

3 Answers3

Linked