I've created a script using python implementing rotation of proxies within it to fetch correct response from some links. This function get_proxy_list()
produces proxies from a source. However, I've hardcoded 5 proxies within that function for brevity.
Now, you can see there are two more functions validate_proxies()
and fetch_response()
. This function validate_proxies()
filters out working proxies from the list of crude proxies generated by get_proxy_list()
.
Finally, this function fetch_response()
uses those working proxies to fetch correct response from the list of urls I've.
I don't know whether this function validate_proxies()
should be of any use at all because I can use those crude proxies directly within fetch_response()
. Moreover, most of the free proxies are short-lived, so by the time I try to filter out those crude proxies, the working proxies are already dead. However, the script runs very slowly even when it finds and uses working proxies.
I've tried with:
import random
import requests
from bs4 import BeautifulSoup
validation_link = 'https://icanhazip.com/'
target_links = [
'https://stackoverflow.com/questions/tagged/web-scraping',
'https://stackoverflow.com/questions/tagged/vba',
'https://stackoverflow.com/questions/tagged/java'
]
working_proxies = []
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
}
def get_proxy_list():
proxy_list = ['198.24.171.26:8001','187.130.139.197:8080','159.197.128.8:3128','119.28.56.116:808','85.15.152.39:3128']
return proxy_list
def validate_proxies(proxies,link):
proxy_url = proxies.pop(random.randrange(len(proxies)))
while True:
proxy = {'https': f'http://{proxy_url}'}
try:
res = requests.get(link,proxies=proxy,headers=headers,timeout=5)
assert res.status_code==200
working_proxies.append(proxy_url)
if not proxies: break
proxy_url = proxies.pop(random.randrange(len(proxies)))
except Exception as e:
print("error raised as:",str(e))
if not proxies: break
proxy_url = proxies.pop(random.randrange(len(proxies)))
return working_proxies
def fetch_response(proxies,url):
proxy_url = proxies.pop(random.randrange(len(proxies)))
while True:
proxy = {'https': f'http://{proxy_url}'}
try:
resp = requests.get(url, proxies=proxy, headers=headers, timeout=7)
assert resp.status_code==200
return resp
except Exception as e:
print("error thrown as:",str(e))
if not proxies: return
proxy_url = proxies.pop(random.randrange(len(proxies)))
if __name__ == '__main__':
proxies = get_proxy_list()
working_proxy_list = validate_proxies(proxies,validation_link)
print("working proxy list:",working_proxy_list)
for target_link in target_links:
print(fetch_response(working_proxy_list,target_link))
Question: what is the right way to use rotation of proxies within a script in order to make the execution faster?