20

I am using Python to scrape pages. Until now I didn't have any complicated issues.

The site that I'm trying to scrape uses a lot of security checks and have some mechanism to prevent scraping.

Using Requests and lxml I was able to scrape about 100-150 pages before getting banned by IP. Sometimes I even get ban on first request (new IP, not used before, different C block). I have tried with spoofing headers, randomize time between requests, still the same.

I have tried with Selenium and I got much better results. With Selenium I was able to scrape about 600-650 pages before getting banned. Here I have also tried to randomize requests (between 3-5 seconds, and make time.sleep(300) call on every 300th request). Despite that, Im getting banned.

From here I can conclude that site have some mechanism where they ban IP if it requested more than X pages in one open browser session or something like that.

Based on your experience what else should I try? Will closing and opening browser in Selenium help (for example after every 100th requests close and open browser). I was thinking about trying with proxies but there are about million of pages and it will be very expansive.

theAlse
  • 5,577
  • 11
  • 68
  • 110
RhymeGuy
  • 2,102
  • 5
  • 32
  • 62

3 Answers3

17

If you would switch to the Scrapy web-scraping framework, you would be able to reuse a number of things that were made to prevent and tackle banning:

This is an extension for automatically throttling crawling speed based on load of both the Scrapy server and the website you are crawling.

Use a random User-Agent provided by fake-useragent every request

Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • Im not fan of Scrapy, but I might give it a try, although I'm not sure it will help me. I have used all of the things you recommend and was not able to pass the limit. – RhymeGuy Feb 01 '16 at 15:16
  • @RhymeGuy it's just a general answer so that it may help others visiting the topic. In your case, I would say switching IPs via a proxy is the way to go. Thanks. – alecxe Feb 01 '16 at 15:18
12

I had this problem too. I used urllib with tor in python3.

  1. download and install tor browser
  2. testing tor

open terminal and type:

curl --socks5-hostname localhost:9050 <http://site-that-blocked-you.com>

if you see result it's worked.

  1. Now we should test in python. Now run this code
import socks
import socket
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup

#set socks5 proxy to use tor

socks.set_default_proxy(socks.SOCKS5, "localhost", 9050)
socket.socket = socks.socksocket
req = Request('http://check.torproject.org', headers={'User-Agent': 'Mozilla/5.0', })
html = urlopen(req).read()
soup = BeautifulSoup(html, 'html.parser')
print(soup('title')[0].get_text())

if you see

Congratulations. This browser is configured to use Tor.

it worked in python too and this means you are using tor for web scraping.

Gruber
  • 2,196
  • 5
  • 28
  • 50
Mohammad Reza
  • 693
  • 9
  • 16
6

You could use proxies.

You can buy several hundred IPs for very cheap, and use selenium as you previously have done. Furthermore I suggest varying the browser your use and other user-agent parameters.

You could iterate over using a single IP address to load only x number of pages and stopping prior to getting banned.

def load_proxy(PROXY_HOST,PROXY_PORT):
        fp = webdriver.FirefoxProfile()
        fp.set_preference("network.proxy.type", 1)
        fp.set_preference("network.proxy.http",PROXY_HOST)
        fp.set_preference("network.proxy.http_port",int(PROXY_PORT))
        fp.set_preference("general.useragent.override","whater_useragent")
        fp.update_preferences()
        return webdriver.Firefox(firefox_profile=fp)
Rahul
  • 10,830
  • 4
  • 53
  • 88
Parsa
  • 3,054
  • 3
  • 19
  • 35
  • 1
    Can you recommend proxy service which i might use? – RhymeGuy Feb 01 '16 at 15:17
  • Thanks, Service looks okay, but not so cheap. Im not even sure that money that I will give for proxy will cover value of information that I will gather. Will have to think again. – RhymeGuy Feb 01 '16 at 15:26
  • if the pages you are searching for are cached by google you could search for them in google and access the static version cached by the google crawler? – Parsa Feb 03 '16 at 12:53
  • Unfortunately site use login form and most of the pages cannot be accessed without login. Therefore Google cannot cache them. Seems that using proxy service is only reasonable option in this case. – RhymeGuy Feb 03 '16 at 14:13
  • How we can change IP using chrome web-driver selenium and python – Mobin Al Hassan May 03 '20 at 12:50