3

I have built a python script that uses Selenium to web-scrape. This script needs to run hours at a time. I am only scraping one website in particular and I have so far been able to scrape peacefully by just rotating browser User Agents from a pool of 1,000 agents.

However, I just scaled my script up using multi-threading and suddenly all of my attempts to visit the website when scraping fail due to CAPTCHA.

Apparently, rotating proxies is the answer. How can I rotate proxies with Selenium?

carpa_jo
  • 630
  • 6
  • 16
  • Multi-threading plus Selenium? I don't really see how that can work. – pguardiario Apr 16 '20 at 00:46
  • Elaborate, it is working fine with concurrent.futures in python – Ludovico Verniani Apr 16 '20 at 02:21
  • But only one webdriver will be active at a time so you won't get concurrency in terns of network IO – pguardiario Apr 16 '20 at 03:59
  • I didn't know that. Are you sure only one web-driver can be active at a time? Because when I run my program, since I have 10 workers, I see 10 chrome windows pop up and begin scraping. – Ludovico Verniani Apr 16 '20 at 04:40
  • I think you'll find that it doesn't go faster than the single webdriver, but you should post back with your experience. – pguardiario Apr 16 '20 at 05:23
  • And also FTR if you want true concurrency with full browsers you can try pyppeteer which AFAIK is the only option for Python. – pguardiario Apr 16 '20 at 07:18
  • Not sure how pyppeteer is different than the normal chrome webdriver but I plan to push my scraper to AWS soon so we'll see if their super computers can scrape quickly with multiple webdriver instances opening and closing. – Ludovico Verniani Apr 16 '20 at 16:48

1 Answers1

1

One way to do it is by using http_request_randomizer (explanation in code comments). As you may know free public proxies are highly unreliable, insecure and prone to getting banned. So I wouldn't recommend using this method for a serious project or in production.

from http_request_randomizer.requests.proxy.requestProxy import RequestProxy
from selenium import webdriver
req_proxy = RequestProxy() #you may get different number of proxy when  you run this at each time
proxies = req_proxy.get_proxy_list() #this will create proxy list

PROXY = proxies[5].get_address() #select the 6th proxy from the list, of course you can randomly loop through proxies
print(proxies[5].country)

webdriver.DesiredCapabilities.CHROME['proxy'] = {
    "httpProxy": PROXY,
    "ftpProxy": PROXY,
    "sslProxy": PROXY,

    "proxyType": "MANUAL",

}
driver = webdriver.Chrome()

driver.get('https://www.expressvpn.com/what-is-my-ip')

The best way to do this, involves a paid proxy service. I'm currently using https://luminati.io/ in a production environment and their service is very reliable, plus it rotates your IP automatically and frequently (almost every request).

See:

Luminati

how to set proxy with authentication in selenium chromedriver python?

Thaer A
  • 2,243
  • 1
  • 10
  • 14
  • You're right, free proxies are not the way to go. This is a project I am pursuing for fun but if succesful I am willing to put more money into it. Is it better to use proxy providers that give you a list of proxies (e.g. 1,000) to rotate through, or to use a proxy provider like SmartProxy that gives you unlimited proxies but charges you based on GB usage per month? – Ludovico Verniani Apr 15 '20 at 21:14
  • @LudovicoVerniani I never used a proxy provider that gives a list of proxies. The service I'm using is similar to SmartProxy, it charges per GB. I guess it all depends on your project, with my project I'm using requests/BS4 and pulling up mainly text pages. It's costing me less than $5/mo. But if your pulling pages with lots of images or video, a list of reliable proxies is probably the better choice. – Thaer A Apr 16 '20 at 06:38
  • The site I am scraping from, in theory, is all text. However, they have many advertisements that are in the form of videos. You think it's possible to tell Selenium to load the 'text' only? – Ludovico Verniani Apr 16 '20 at 16:50