HTTP requests made in parallel thread workers are occasionally being delayed and terminated

Question

I am trying to debug a Flask/Gunicorn application running with Python 3.6.3 on RHEL 7.7, part of which sends queries (GET requests) to a backend API that is running on a number of different servers/URLs. Since we are IO-limited, each of these queries is run in parallel on its own thread. The problem I'm dealing with is that these requests are occasionally taking excessively long (by orders of magnitude) to complete, and even being terminated unexpectedly. Here is a greatly simplified version of the relevant code:

from concurrent.futures import ThreadPoolExecutor, as_completed

import requests # Version 2.22.0, OpenSSL 1.0.2k-fips  26 Jan 2017

def query(url):
    response = requests.get(url)
    return response.json()

url_list = [url1, url2, ...]

executor = ThreadPoolExecutor(max_workers=len(url_list))
request_objects = list()

# Schedule queries to run in separate threads
for url in url_list:
    request_objects.append(executor.submit(query, url=url))

# Join all threads
list(as_completed(request_objects))

Usually, this works as expected and everything is fine. But the requests.get call is sporadically raising various errors, all of which have to do with the connection being suddenly and unexpectedly terminated. I am unable to reproduce these errors in any sort of deterministic way. They include:

requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))
requests.exceptions.SSLError: EOF occurred in violation of protocol (_ssl.c:777)
requests.exceptions.ChunkedEncodingError: ('Connection broken: IncompleteRead(0 bytes read, 2 more expected)', IncompleteRead(0 bytes read, 2 more expected))
requests.exceptions.SSLError: TLS/SSL connection has been closed (EOF) (_ssl.c:777)

After doing some detective work in the error logs, the problem got more interesting. These termination errors tend to occur in short bursts, usually all in one Gunicorn worker process, when a number of automated requests to our API come in (from Telegraf) at roughly the same time. But the request durations (the length of time the call to requests.get blocks) are especially interesting. For one group of errors, the durations of queries that failed were:

205.52187418937683
205.52265071868896
205.6381540298462

And the durations of successful queries were:

2.622708320617676
2.64787220954895
2.939096212387085
2.9614670276641846
2.970625400543213
3.083200693130493
3.0918056964874268
204.15788435935974
204.15829873085022
204.17250561714172
204.1836473941803
204.19148349761963
204.19750094413757
204.70406007766724
205.1262927055359
205.12787008285522
205.51073217391968
205.5255045890808
205.52853202819824
205.53887248039246
205.57158827781677
205.58390355110168
205.78088784217834
205.80147290229797
205.8019723892212
205.80248093605042
207.37307357788086
207.37479543685913
207.3749725818634
207.38070940971375
207.39173579216003

Notice the clustering of the failed durations at around 205 seconds (meaning all three connections were terminated at about the same time), and the sharp jump from GET requests blocking for 3 seconds to 200+ seconds. From the access logs of the backend API, I've confirmed that these queries are not taking anywhere near 200 seconds to process--at least, not nearly this many of them. The query times of a few seconds roughly reflect how long the backend is actually taking to process the query and return data, but the longer times (failed and succeeded) do not. Something is delaying our reception of the response and, in some cases, breaking the connection off even after the backend API has returned valid data.

Having ruled out the possibility that the backend API is overloaded and actually taking that long to respond (which would result in a different, known error), my suspicion is that something like the OS, a proxy, SSL, or something in the network chain is causing these hangups. But I don't know enough to offer any specific guesses, much less confirm them.

This should work (albeit possibly slower, but not necessarily) with `max_workers` set to a lower number, have you tried that? — DisappointedByUnaccountableMod, Sep 24 '19 at 20:32
If you’re using `requests.Session()` it may not be thread-safe. Also see https://stackoverflow.com/questions/18188044/is-the-session-object-from-pythons-requests-library-thread-safe#20457621 and https://github.com/psf/requests/issues/1871 although that is very old. — DisappointedByUnaccountableMod, Sep 24 '19 at 20:41
I'm fairly sure I am not; I don't see any reference to it in the code base and each thread is making a single request, so it would be superfluous. (It looks like get() does create a Session behind the scenes, but there should be at least one for every thread) Why could using fewer workers speed it up? — David Pitchford, Sep 25 '19 at 12:31
Max_workers being lower is more a case of possibly not slowing things down - but there will be fewer active SSL connections, perhaps that will make the problem you are seeing less likely. You don’t know why you are having problems, neither do I, just trying to explore options. — DisappointedByUnaccountableMod, Sep 25 '19 at 19:12

HTTP requests made in parallel thread workers are occasionally being delayed and terminated

0 Answers0