0

Here's the complete program:

import requests
from bs4 import BeautifulSoup as BS
from concurrent.futures import ThreadPoolExecutor
import time

TIMEOUT = 0.2
URL = 'http://ip-api.com/json'
FPURL = 'https://free-proxy-list.net'
CLASS = 'table table-striped table-bordered'

def try_proxy(td):
    try:
        host = td[0].text
        proxy = f'{host}:{td[1].text}'
        (r := requests.get(URL, proxies={'http': proxy}, timeout=TIMEOUT)).raise_for_status()
        if r.json().get('query', '') == host:
            return proxy
    except Exception:
        pass

(r := requests.get(FPURL)).raise_for_status()

proxies = [tr('td') for tr in BS(r.text, 'lxml').find('table', class_=CLASS)('tr')]

print(f'Checking {len(proxies)} proxies')
s = time.perf_counter()
with ThreadPoolExecutor() as executor:
    for f in executor.map(try_proxy, proxies):
        if f is not None:
            print(f)
    print(f'max_workers = {executor._max_workers}')
    
e = time.perf_counter()
print(f'{e-s:.2f}s')

The objective here is to get a list of potential publicly available (free) proxies that seem to be available and also perform in a timely manner.

That list is obtained by scraping free-proxy-list.net. This currently reveals 301 potential proxy addresses. Then, using the requests module we call an API (ip-api.com/json) that responds with information about the origin of the request. In that way we can prove (or disprove) whether the specified proxy was actually used.

Multithreading seems to be the optimum approach to get the job done as fast as possible.

The program works and executes (as written) in ~2.9 seconds.

Now to my point...

Since Python 3.8 the default strategy for max_workers has been min(32, os.cpu_count() + 4) which, in my case, equates to 24 because os.cpu_count() == 20 on a 10-core Xeon processor.

However, I have determined empirically that giving an explicit max_workers value of 100 reduces the execution time to <1s. Clearly, that is very significant.

For reasons of portability I have long argued that unless there's a reason for explicitly setting max_workers to a very low value then it should be left at its default. But this approach appears to be at the cost of lower performance for some scenarios.

Is there a better (portable) way to optimise performance for this and similar applications?

DarkKnight
  • 19,739
  • 3
  • 6
  • 22
  • This answer might help: https://stackoverflow.com/questions/68226294/thread-pool-executor-using-concurrent-no-improvement-for-various-number-of-work – Prats Feb 10 '22 at 16:41

1 Answers1

0

No, it's not.

Generally, ThreadPoolExecutor is OK for tasks that require a lot of I/O while ProcessPoolExecutor does better when there is need for lot of computational power. Also, ThreadPoolExecutor is limited by the Python Global Interpreter Lock (GIL) (among other resources check this).

If the time of execution is really important devise your own benchmarks or/and use a profiler. Customise them according to your needs/data.

For really huge data sets the time savings can worth creating your own benchmarks. Also, it might be the case, for instance, that for data sorted in a certain way, a set amount of workers might be optimal in the beginning of the data set, while towards the end another amount of workers might be optimal.

For instance, you could do:

import concurrent.futures
import glob
import time

def manipulate_data_function(data):
    result = torture_data(data)
    return result

timings_1 = []
for workers in range(32, 1, -1):
    start_1 = time.perf_counter()
    with concurrent.futures.ProcessPoolExecutor(max_workers = workers) as executor:
    futures = []
        for file in glob.glob('*txt'):
            futures.append(executor.submit(manipulate_data_function, data))
    timings_1.append(time.perf_counter() - start_1)

timings_1 will return a list with the duration of your calculations for workers from 32 to 1.

Konstantinos
  • 4,096
  • 3
  • 19
  • 28
  • There's no doubt that this kind of benchmarking can help when the bounds of data are well known (if only approximately). Indeed, that's essentially how I concluded that max_workers of ~100 was best for me. However, part of my question was about portability. Just because 100 workers is good for me doesn't mean that it will be equally successful elsewhere. Not everyone has high-core CPUs to work with. I have concluded that for this particular case it's best to accept that performance might be less than ideal by not specifying max_workers at all. At least it will be portable – DarkKnight Feb 11 '22 at 16:09
  • @OlvinRoght Hmm... You are right. Perhaps create a mini benchmarking function (`calibrating_optimal_workers`) which sets the optimal workers before it runs the huge task. – Konstantinos Feb 11 '22 at 16:15