Here's the complete program:
import requests
from bs4 import BeautifulSoup as BS
from concurrent.futures import ThreadPoolExecutor
import time
TIMEOUT = 0.2
URL = 'http://ip-api.com/json'
FPURL = 'https://free-proxy-list.net'
CLASS = 'table table-striped table-bordered'
def try_proxy(td):
try:
host = td[0].text
proxy = f'{host}:{td[1].text}'
(r := requests.get(URL, proxies={'http': proxy}, timeout=TIMEOUT)).raise_for_status()
if r.json().get('query', '') == host:
return proxy
except Exception:
pass
(r := requests.get(FPURL)).raise_for_status()
proxies = [tr('td') for tr in BS(r.text, 'lxml').find('table', class_=CLASS)('tr')]
print(f'Checking {len(proxies)} proxies')
s = time.perf_counter()
with ThreadPoolExecutor() as executor:
for f in executor.map(try_proxy, proxies):
if f is not None:
print(f)
print(f'max_workers = {executor._max_workers}')
e = time.perf_counter()
print(f'{e-s:.2f}s')
The objective here is to get a list of potential publicly available (free) proxies that seem to be available and also perform in a timely manner.
That list is obtained by scraping free-proxy-list.net. This currently reveals 301 potential proxy addresses. Then, using the requests module we call an API (ip-api.com/json) that responds with information about the origin of the request. In that way we can prove (or disprove) whether the specified proxy was actually used.
Multithreading seems to be the optimum approach to get the job done as fast as possible.
The program works and executes (as written) in ~2.9 seconds.
Now to my point...
Since Python 3.8 the default strategy for max_workers has been min(32, os.cpu_count() + 4) which, in my case, equates to 24 because os.cpu_count() == 20 on a 10-core Xeon processor.
However, I have determined empirically that giving an explicit max_workers value of 100 reduces the execution time to <1s. Clearly, that is very significant.
For reasons of portability I have long argued that unless there's a reason for explicitly setting max_workers to a very low value then it should be left at its default. But this approach appears to be at the cost of lower performance for some scenarios.
Is there a better (portable) way to optimise performance for this and similar applications?