1

I've created a script using python implementing rotation of proxies within it to fetch correct response from some links. This function get_proxy_list() produces proxies from a source. However, I've hardcoded 5 proxies within that function for brevity.

Now, you can see there are two more functions validate_proxies() and fetch_response(). This function validate_proxies() filters out working proxies from the list of crude proxies generated by get_proxy_list().

Finally, this function fetch_response() uses those working proxies to fetch correct response from the list of urls I've.

I don't know whether this function validate_proxies() should be of any use at all because I can use those crude proxies directly within fetch_response(). Moreover, most of the free proxies are short-lived, so by the time I try to filter out those crude proxies, the working proxies are already dead. However, the script runs very slowly even when it finds and uses working proxies.

I've tried with:

import random
import requests
from bs4 import BeautifulSoup

validation_link = 'https://icanhazip.com/'

target_links = [
    'https://stackoverflow.com/questions/tagged/web-scraping',
    'https://stackoverflow.com/questions/tagged/vba',
    'https://stackoverflow.com/questions/tagged/java'
]

working_proxies = []

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
}

def get_proxy_list():
    proxy_list = ['198.24.171.26:8001','187.130.139.197:8080','159.197.128.8:3128','119.28.56.116:808','85.15.152.39:3128']
    return proxy_list


def validate_proxies(proxies,link):
    proxy_url = proxies.pop(random.randrange(len(proxies)))
    while True:
        proxy = {'https': f'http://{proxy_url}'}
        try:
            res = requests.get(link,proxies=proxy,headers=headers,timeout=5)
            assert res.status_code==200
            working_proxies.append(proxy_url)
            if not proxies: break
            proxy_url = proxies.pop(random.randrange(len(proxies)))
        except Exception as e:
            print("error raised as:",str(e))
            if not proxies: break
            proxy_url = proxies.pop(random.randrange(len(proxies)))

    return working_proxies


def fetch_response(proxies,url):
    proxy_url = proxies.pop(random.randrange(len(proxies)))

    while True:
        proxy = {'https': f'http://{proxy_url}'}
        try:
            resp = requests.get(url, proxies=proxy, headers=headers, timeout=7)
            assert resp.status_code==200
            return resp
        except Exception as e:
            print("error thrown as:",str(e))
            if not proxies: return 
            proxy_url = proxies.pop(random.randrange(len(proxies)))


if __name__ == '__main__':
    proxies = get_proxy_list()
    working_proxy_list = validate_proxies(proxies,validation_link)

    print("working proxy list:",working_proxy_list)

    for target_link in target_links:
        print(fetch_response(working_proxy_list,target_link))

Question: what is the right way to use rotation of proxies within a script in order to make the execution faster?

SMTH
  • 67
  • 1
  • 4
  • 17
  • *ideal way* is highly opinionated. What eaxctly is the issue you're having? – baduker Jun 15 '21 at 12:07
  • I'm not having any issue. I wish to know which way I should stick with. Replaced `ideal way` with `right way` by the way. Thanks. – SMTH Jun 15 '21 at 13:33
  • 1
    So, what's *not right* with the way you're using? – baduker Jun 15 '21 at 13:40
  • @baduker The fact that every unavailable proxy can make it wait up to 5 seconds seems not right. They can (and should) be checked in parallel. – Will Da Silva Jun 19 '21 at 17:03

1 Answers1

2

I've made a few changes to your code that will hopefully help you:

  • Since you mentioned that the proxies are short-lived, the code now fetches new proxies and checks if they work on every request.
  • Checking if proxies is now done in parallel using a concurrent.futures.ThreadPoolExecutor. This means that instead of waiting up to 5 seconds for each proxy check to timeout, you will wait at most 5 seconds for all for them to timeout.
  • Instead of randomly choosing a proxy, the first proxy that is found to be working is used.
  • Type hints have been added.
import itertools as it
from concurrent.futures import ThreadPoolExecutor, TimeoutError
from typing import Dict

from bs4 import BeautifulSoup
import requests


Proxy = Dict[str, str]

executor = ThreadPoolExecutor()

validation_link = 'https://icanhazip.com/'

target_links = [
    'https://stackoverflow.com/questions/tagged/web-scraping',
    'https://stackoverflow.com/questions/tagged/vba',
    'https://stackoverflow.com/questions/tagged/java'
]

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
}


def get_proxy_list():
    response = requests.get('https://www.sslproxies.org/')
    soup = BeautifulSoup(response.text,"html.parser")
    proxies = [':'.join([item.select_one('td').text,item.select_one('td:nth-of-type(2)').text]) for item in soup.select('table.table tr') if ('yes' in item.text and 'elite proxy' in item.text)]
    return [{'https': f'http://{x}'} for x in proxies]


def validate_proxy(proxy: Proxy) -> Proxy:
    res = requests.get(validation_link, proxies=proxy, headers=headers, timeout=5)
    assert 200 == res.status_code
    return proxy


def get_working_proxy() -> Proxy:
    futures = [executor.submit(validate_proxy, x) for x in get_proxy_list()]
    for i in it.count():
        future = futures[i % len(futures)]
        try:
            working_proxy = future.result(timeout=0.01)
            for f in futures:
                f.cancel()
            return working_proxy
        except TimeoutError:
            continue
        except Exception:
            futures.remove(future)
            if not len(futures):
                raise Exception('No working proxies found') from None


def fetch_response(url: str) -> requests.Response:
    res = requests.get(url, proxies=get_working_proxy(), headers=headers, timeout=7)
    assert res.status_code == 200
    return res

Usage:

>>> get_working_proxy()
{'https': 'http://119.81.189.194:80'}
>>> get_working_proxy()
{'https': 'http://198.50.163.192:3129'}
>>> get_working_proxy()
{'https': 'http://191.241.145.22:6666'}
>>> get_working_proxy()
{'https': 'http://169.57.1.84:8123'}
>>> get_working_proxy()
{'https': 'http://182.253.171.31:8080'}

In each case, one of the proxies with the lowest latency is returned.

If you want to make the code even more efficient, and you can be almost certain that a working proxy will still be working in some short amount of time (e.g. 30 seconds), then you can upgrade this by putting the proxies into a TTL cache, and repopulating it as necessary, rather than finding a working proxy every time you call fetch_response. See https://stackoverflow.com/a/52128389/5946921 for how to implement a TTL cache in Python.

Will Da Silva
  • 6,386
  • 2
  • 27
  • 52
  • It appears to be a nice idea to go with @Will Da Silva. The problem that I noticed is when I use few more proxies in the list, this function `get_working_proxy()` returns the first one successfully. However, it then gets stuck (not returning any result) even when I can see that `validate_proxy()` function produces working ones (you can see 200 status right next to working proxies in the image). This [image](https://imgur.com/hWqRqES) represents what I meant and this is [the script](https://pastebin.com/PTbK8b1M) I tested with. Let me know if I got it all wrong. Thanks. – SMTH Jun 19 '21 at 09:43
  • @SMTH I've updated the code. It now uses a `ThreadPoolExecutor` instead of a `ThreadPool` so that the remaining jobs can be cancelled after a working proxy has been found. As you can see from the usage section I added, it works, and can be called multiple times, each time returning one a working proxy among those with the lowest latency. – Will Da Silva Jun 19 '21 at 17:33
  • Thanks for adding another alternative. It seems I found success using with block in your earlier implementation. [This is](https://pastebin.com/LY1MNM7H) how I did it. Let me know if i did it wrong. Thanks again. – SMTH Jun 19 '21 at 17:39
  • @SMTH Happy to hear you've got it working. What you did there was the other way I was thinking of solving the issue you mentioned. I decided against it because creating/destroying the thread pool is somewhat expensive compared to using the same one across all calls. In any case, profiling would need to be done to determine what the fastest method is, and how big a difference there is. – Will Da Silva Jun 19 '21 at 17:43