-1

I've written a script in scrapy in combination with selenium to make proxied requests using newly generated proxies by get_proxies() method. I used requests module to fetch the proxies in order to reuse them in the script. What I'm trying to do is parse all the post links from it's landing page and then fetch the name of each title from it's target page.

My following script works inconsistently because when get_random_proxy function produces a usable proxy then I get my script working otherwise it fails miserably.

How can I make my script keep trying with different proxies until it runs successfully?

I've written so far:

import scrapy
import random
import requests
from itertools import cycle
from bs4 import BeautifulSoup
from selenium import webdriver
from scrapy.crawler import CrawlerProcess
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support import expected_conditions as EC

def get_proxies():   
    response = requests.get("https://www.sslproxies.org/")
    soup = BeautifulSoup(response.text,"lxml")
    proxies = [':'.join([item.select_one("td").text,item.select_one("td:nth-of-type(2)").text]) for item in soup.select("table.table tr") if "yes" in item.text]
    return proxies

def get_random_proxy(proxy_vault):
    random.shuffle(proxy_vault)
    proxy_url = next(cycle(proxy_vault))
    return proxy_url

def start_script():
    proxy = get_proxies()
    chrome_options = webdriver.ChromeOptions()
    chrome_options.add_argument(f'--proxy-server={get_random_proxy(proxy)}')
    driver = webdriver.Chrome(options=chrome_options)
    return driver

class StackBotSpider(scrapy.Spider):
    name = "stackoverflow"

    start_urls = [
        'https://stackoverflow.com/questions/tagged/web-scraping'
    ]

    def __init__(self):
        self.driver = start_script()
        self.wait = WebDriverWait(self.driver, 10)

    def parse(self,response):
        self.driver.get(response.url)
        for elem in self.wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".summary .question-hyperlink"))):
            yield scrapy.Request(elem.get_attribute("href"),callback=self.parse_details)

    def parse_details(self,response):
        self.driver.get(response.url)
        for elem in self.wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "h1[itemprop='name'] > a"))):
            yield {"post_title":elem.text}

c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',   
})
c.crawl(StackBotSpider)
c.start()
robots.txt
  • 96
  • 2
  • 10
  • 36
  • 2
    The problem lies in your `CrawlerProcess`, `crawl()` method you should encapsulate some of it internal calls with `try` and `except` missing that part of code from your question it is hard to help you out. – CodeSamurai777 May 13 '19 at 22:06
  • 2
    Your code uses scrapy and selenium except scrapy does absolutely nothing here. You should either stick to selenium or scrapy or research selenium integrations for scrapy. For scrapy proxy management addons take a look at [scrapy-rotating-proxies](https://github.com/TeamHG-Memex/scrapy-rotating-proxies) for selenium switching proxies during run time is a bit more complicated. – Granitosaurus May 14 '19 at 07:37
  • 1
    In case of using rotation of proxies within scrapy, I find [this answer](https://stackoverflow.com/a/56008077/10568531) second to none which doesn't depend on any addons and can be executed just the way it is @Granitosaurus. – robots.txt May 15 '19 at 17:01

3 Answers3

1

You can make use of the requests library when selecting a random proxy to check whether or not the proxy is working. Loop through the proxies:

  1. Shuffle and pick (pop) a random proxy
  2. Check with requests, if succeeds, return the proxy otherwise go to step 1

Change your get_random_proxy to something like this:

def get_random_proxy(proxy_vault):
    while proxy_vault:
        random.shuffle(proxy_vault)
        proxy_url = proxy_vault.pop()
        proxy_dict = {
            'http': proxy_url,
            'https': proxy_url
        }
        try:
            res = requests.get("http://example.com", proxies=proxy_dict, timeout=10)
            res.raise_for_status()
            return proxy_url
        except:
            continue

If get_random_proxy returns None, that means none of the proxies are working. In that case omit the --proxy-server argument.

def start_script():
    proxy = get_proxies()
    chrome_options = webdriver.ChromeOptions()
    random_proxy = get_random_proxy(proxy)
    if random_proxy: # only when we successfully find a working proxy
        chrome_options.add_argument(f'--proxy-server={random_proxy}')
    driver = webdriver.Chrome(options=chrome_options)
    return driver
Mezbaul Haque
  • 1,242
  • 9
  • 13
  • `get_random_proxy()` will never fail to produce some proxy unless there is a connection error. I've tested it several times. You need to work on another angle to fix the issue @Mezba メ. Thanks. – robots.txt Jun 09 '19 at 17:38
  • @robots.txt You want to make sure that you get a working proxy, isn't that the goal? `get_random_proxy` takes care of that by testing random proxies. As long as `requests.get` is working using a proxy, it should not fail. I'm not really sure what you really want here. – Mezbaul Haque Jun 09 '19 at 17:49
  • I got you wrong @Mezba メ. Your solution seems to be working. Get back to you once I'm done checking on it. Thanks. – robots.txt Jun 09 '19 at 18:01
0

As you have tagged , using only Selenium you can make proxied requests using the newly active proxies which gets listed within the Free Proxy List using the following solution:

Note: This program will invoke the proxies from the Proxy List one by one until a successful proxied connection is established and verified through Proxy Check page of https://www.whatismyip.com/

  • Code Block:

    from selenium import webdriver
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    from selenium.common.exceptions import TimeoutException
    
    options = webdriver.ChromeOptions()
    options.add_argument('start-maximized')
    options.add_argument('disable-infobars')
    options.add_argument('--disable-extensions')
    driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
    driver.get("https://sslproxies.org/")
    driver.execute_script("return arguments[0].scrollIntoView(true);", WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//table[@class='table table-striped table-bordered dataTable']//th[contains(., 'IP Address')]"))))
    ips = [my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//table[@class='table table-striped table-bordered dataTable']//tbody//tr[@role='row']/td[position() = 1]")))]
    ports = [my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//table[@class='table table-striped table-bordered dataTable']//tbody//tr[@role='row']/td[position() = 2]")))]
    driver.quit()
    proxies = []
    for i in range(0, len(ips)):
        proxies.append(ips[i]+':'+ports[i])
    print(proxies)
    for i in range(0, len(proxies)):
        try:
            print("Proxy selected: {}".format(proxies[i]))
            options = webdriver.ChromeOptions()
            options.add_argument('--proxy-server={}'.format(proxies[i]))
            driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
            driver.get("https://www.whatismyip.com/proxy-check/?iref=home")
            if "Proxy Type" in WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "p.card-text"))):
                break
        except Exception:
            driver.quit()
    print("Proxy Invoked")
    
  • Console Output:

    ['190.7.158.58:39871', '175.139.179.65:54980', '186.225.45.146:45672', '185.41.99.100:41258', '43.230.157.153:52986', '182.23.32.66:30898', '36.37.160.253:31450', '93.170.15.214:56305', '36.67.223.67:43628', '78.26.172.44:52490', '36.83.135.183:3128', '34.74.180.144:3128', '206.189.122.177:3128', '103.194.192.42:55546', '70.102.86.204:8080', '117.254.216.97:23500', '171.100.221.137:8080', '125.166.176.153:8080', '185.146.112.24:8080', '35.237.104.97:3128']
    
    Proxy selected: 190.7.158.58:39871
    
    Proxy selected: 175.139.179.65:54980
    
    Proxy selected: 186.225.45.146:45672
    
    Proxy selected: 185.41.99.100:41258
    
undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
  • 2
    The question is not about how I can use rotation of proxies when trying with selenium (I already know it); rather, It is about how I can do the same when combined with scrapy. Hope you got me @DebanjanB. Thanks. – robots.txt Jun 09 '19 at 06:47
  • @robots.txt As you have tagged [tag:selenium] hence I have tried to construct a canonical solution to help you out purely based on [Selenium](https://docs.seleniumhq.org/). However, _Scrapy_ is based on _requests_ and can perform the same job in a similar fashion. – undetected Selenium Jun 09 '19 at 15:14
0

You can try using scrapy-rotated-proxy

Here is another reference that could be helpful to you: https://www.scrapehero.com/how-to-rotate-proxies-and-ip-addresses-using-python-3/

Check the part:

DOWNLOADER_MIDDLEWARES = {
    'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
}

ROTATING_PROXY_LIST = [
    'proxy1.com:8000',
    'proxy2.com:8031',
    # ...
]

Try this in your setting and surely you will get what you wanted. Hope this would help you.