I am trying to debug a Flask/Gunicorn application running with Python 3.6.3 on RHEL 7.7, part of which sends queries (GET requests) to a backend API that is running on a number of different servers/URLs. Since we are IO-limited, each of these queries is run in parallel on its own thread. The problem I'm dealing with is that these requests are occasionally taking excessively long (by orders of magnitude) to complete, and even being terminated unexpectedly. Here is a greatly simplified version of the relevant code:
from concurrent.futures import ThreadPoolExecutor, as_completed
import requests # Version 2.22.0, OpenSSL 1.0.2k-fips 26 Jan 2017
def query(url):
response = requests.get(url)
return response.json()
url_list = [url1, url2, ...]
executor = ThreadPoolExecutor(max_workers=len(url_list))
request_objects = list()
# Schedule queries to run in separate threads
for url in url_list:
request_objects.append(executor.submit(query, url=url))
# Join all threads
list(as_completed(request_objects))
Usually, this works as expected and everything is fine. But the requests.get call is sporadically raising various errors, all of which have to do with the connection being suddenly and unexpectedly terminated. I am unable to reproduce these errors in any sort of deterministic way. They include:
- requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))
- requests.exceptions.SSLError: EOF occurred in violation of protocol (_ssl.c:777)
- requests.exceptions.ChunkedEncodingError: ('Connection broken: IncompleteRead(0 bytes read, 2 more expected)', IncompleteRead(0 bytes read, 2 more expected))
- requests.exceptions.SSLError: TLS/SSL connection has been closed (EOF) (_ssl.c:777)
After doing some detective work in the error logs, the problem got more interesting. These termination errors tend to occur in short bursts, usually all in one Gunicorn worker process, when a number of automated requests to our API come in (from Telegraf) at roughly the same time. But the request durations (the length of time the call to requests.get blocks) are especially interesting. For one group of errors, the durations of queries that failed were:
- 205.52187418937683
- 205.52265071868896
- 205.6381540298462
And the durations of successful queries were:
- 2.622708320617676
- 2.64787220954895
- 2.939096212387085
- 2.9614670276641846
- 2.970625400543213
- 3.083200693130493
- 3.0918056964874268
- 204.15788435935974
- 204.15829873085022
- 204.17250561714172
- 204.1836473941803
- 204.19148349761963
- 204.19750094413757
- 204.70406007766724
- 205.1262927055359
- 205.12787008285522
- 205.51073217391968
- 205.5255045890808
- 205.52853202819824
- 205.53887248039246
- 205.57158827781677
- 205.58390355110168
- 205.78088784217834
- 205.80147290229797
- 205.8019723892212
- 205.80248093605042
- 207.37307357788086
- 207.37479543685913
- 207.3749725818634
- 207.38070940971375
- 207.39173579216003
Notice the clustering of the failed durations at around 205 seconds (meaning all three connections were terminated at about the same time), and the sharp jump from GET requests blocking for 3 seconds to 200+ seconds. From the access logs of the backend API, I've confirmed that these queries are not taking anywhere near 200 seconds to process--at least, not nearly this many of them. The query times of a few seconds roughly reflect how long the backend is actually taking to process the query and return data, but the longer times (failed and succeeded) do not. Something is delaying our reception of the response and, in some cases, breaking the connection off even after the backend API has returned valid data.
Having ruled out the possibility that the backend API is overloaded and actually taking that long to respond (which would result in a different, known error), my suspicion is that something like the OS, a proxy, SSL, or something in the network chain is causing these hangups. But I don't know enough to offer any specific guesses, much less confirm them.