0

I am trying to debug a Flask/Gunicorn application running with Python 3.6.3 on RHEL 7.7, part of which sends queries (GET requests) to a backend API that is running on a number of different servers/URLs. Since we are IO-limited, each of these queries is run in parallel on its own thread. The problem I'm dealing with is that these requests are occasionally taking excessively long (by orders of magnitude) to complete, and even being terminated unexpectedly. Here is a greatly simplified version of the relevant code:

from concurrent.futures import ThreadPoolExecutor, as_completed

import requests # Version 2.22.0, OpenSSL 1.0.2k-fips  26 Jan 2017

def query(url):
    response = requests.get(url)
    return response.json()

url_list = [url1, url2, ...]

executor = ThreadPoolExecutor(max_workers=len(url_list))
request_objects = list()

# Schedule queries to run in separate threads
for url in url_list:
    request_objects.append(executor.submit(query, url=url))

# Join all threads
list(as_completed(request_objects))

Usually, this works as expected and everything is fine. But the requests.get call is sporadically raising various errors, all of which have to do with the connection being suddenly and unexpectedly terminated. I am unable to reproduce these errors in any sort of deterministic way. They include:

  • requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))
  • requests.exceptions.SSLError: EOF occurred in violation of protocol (_ssl.c:777)
  • requests.exceptions.ChunkedEncodingError: ('Connection broken: IncompleteRead(0 bytes read, 2 more expected)', IncompleteRead(0 bytes read, 2 more expected))
  • requests.exceptions.SSLError: TLS/SSL connection has been closed (EOF) (_ssl.c:777)

After doing some detective work in the error logs, the problem got more interesting. These termination errors tend to occur in short bursts, usually all in one Gunicorn worker process, when a number of automated requests to our API come in (from Telegraf) at roughly the same time. But the request durations (the length of time the call to requests.get blocks) are especially interesting. For one group of errors, the durations of queries that failed were:

  • 205.52187418937683
  • 205.52265071868896
  • 205.6381540298462

And the durations of successful queries were:

  • 2.622708320617676
  • 2.64787220954895
  • 2.939096212387085
  • 2.9614670276641846
  • 2.970625400543213
  • 3.083200693130493
  • 3.0918056964874268
  • 204.15788435935974
  • 204.15829873085022
  • 204.17250561714172
  • 204.1836473941803
  • 204.19148349761963
  • 204.19750094413757
  • 204.70406007766724
  • 205.1262927055359
  • 205.12787008285522
  • 205.51073217391968
  • 205.5255045890808
  • 205.52853202819824
  • 205.53887248039246
  • 205.57158827781677
  • 205.58390355110168
  • 205.78088784217834
  • 205.80147290229797
  • 205.8019723892212
  • 205.80248093605042
  • 207.37307357788086
  • 207.37479543685913
  • 207.3749725818634
  • 207.38070940971375
  • 207.39173579216003

Notice the clustering of the failed durations at around 205 seconds (meaning all three connections were terminated at about the same time), and the sharp jump from GET requests blocking for 3 seconds to 200+ seconds. From the access logs of the backend API, I've confirmed that these queries are not taking anywhere near 200 seconds to process--at least, not nearly this many of them. The query times of a few seconds roughly reflect how long the backend is actually taking to process the query and return data, but the longer times (failed and succeeded) do not. Something is delaying our reception of the response and, in some cases, breaking the connection off even after the backend API has returned valid data.

Having ruled out the possibility that the backend API is overloaded and actually taking that long to respond (which would result in a different, known error), my suspicion is that something like the OS, a proxy, SSL, or something in the network chain is causing these hangups. But I don't know enough to offer any specific guesses, much less confirm them.

  • This should work (albeit possibly slower, but not necessarily) with `max_workers` set to a lower number, have you tried that? – DisappointedByUnaccountableMod Sep 24 '19 at 20:32
  • 1
    If you’re using `requests.Session()` it may not be thread-safe. Also see https://stackoverflow.com/questions/18188044/is-the-session-object-from-pythons-requests-library-thread-safe#20457621 and https://github.com/psf/requests/issues/1871 although that is very old. – DisappointedByUnaccountableMod Sep 24 '19 at 20:41
  • I'm fairly sure I am not; I don't see any reference to it in the code base and each thread is making a single request, so it would be superfluous. (It looks like get() does create a Session behind the scenes, but there should be at least one for every thread) Why could using fewer workers speed it up? – David Pitchford Sep 25 '19 at 12:31
  • Max_workers being lower is more a case of possibly not slowing things down - but there will be fewer active SSL connections, perhaps that will make the problem you are seeing less likely. You don’t know why you are having problems, neither do I, just trying to explore options. – DisappointedByUnaccountableMod Sep 25 '19 at 19:12

0 Answers0