0

I have external events (inotify, etc.) that are collected in one thread and put into a queue for another threat that has to request certain urls generated by these events. Unfortunately the requests.get even with the timeout set sometimes pauses indefinitely. I think the server/cloudflare is "blocking" the request and just keeps the connection open...

How can I set a timeout for the requests.get that stops it execution after a certain time and retries or steps over it?

json_queue = queue.Queue()

def thread1():
    while True:
        for event in some_external_event:
            (url) = event
            json_queue.put(url)

def thread2():
    while True:
        try:
            item = json_queue.get() 
            r = requests.get(url=item, timeout=3)
            # DOESNT REACH THIS POINT
            json_queue.task_done()
        except:
            raise #EXCEPTION NEVER GETS RAISED

t1 = threading.Thread(target=thread1)
t1.start()

t2 = threading.Thread(target=thread2)
t2.start()
Philipp
  • 65
  • 7
  • Put it inside `try .. except` clause which react on [`requests.Timeout`](https://requests.readthedocs.io/en/latest/api/#requests.Timeout) exception and either move `task_done()` call into `else` or call `continue` in `except`. – Olvin Roght Jun 25 '22 at 20:03
  • @OlvinRoght In the real code it is in `try .. except`. An Exception is never raised. The timeout from the requests never gets used. I update the question... – Philipp Jun 25 '22 at 20:06
  • 1
    If you're executing threads properly and `requests.get()` call happens it's not possible that it hangs without exception. If that's complex project, add some logging to find out exact line of code where code execution stops. – Olvin Roght Jun 25 '22 at 20:10
  • @OlvinRoght I did a lot of logging. And it really pauses at the `request.get()` line. It just sits there. The other thread `t1` is continuing without any problem. The script generally works, just to clarify, but this part then pauses after a few hours. The complete script/app is not that complicated (100 lines). I could show/send you in total... – Philipp Jun 26 '22 at 17:24
  • 1
    A I told in comments under answer below, I am actively using requests in various projects, even high-load ones and never faced any problems. I've briefly sniffed functional codes in couple of them and noticed that I'm always initializing separate [`Session()`](https://requests.readthedocs.io/en/latest/user/advanced/#session-objects) for each thread. I am not sure does it change anything, but as you said that your project is tiny, you could add `sess = Session()` and use `sess.get(...)` after. – Olvin Roght Jun 26 '22 at 17:35
  • @OlvinRoght ok, I added it to the function. Since it appears after several hours, I'll have to wait to see if it helps. Just by the documentation I don't understand why this would help, since there is only this thread doing requests in consecutive order waiting for the response and there should theoretically be no concurrency problems... But I'm not experienced with this. – Philipp Jun 26 '22 at 17:51
  • @OlvinRoght Here is the complete code: https://gist.github.com/pixply/2f07a7b69ccf1a22a6624ccef935c44f I love to donate for a solution – Philipp Jun 26 '22 at 17:55

1 Answers1

0

I understand the problem to be that executing

            print('starting')
            r = requests.get(url=item, timeout=3)
            print('success')

will show start, but never success.

The docs explain urllib3's use of connection pools. You will want to allocate per-thread pools to avoid racing and stepping on toes.

pool = urllib3.PoolManager()
...
r = pool.request('GET', url=item, timeout=3)

See also the concerns raised in this SO article.

J_H
  • 17,926
  • 4
  • 24
  • 44
  • I'm using `requests` in dozens of project with both multithreading and multiprocessing concurrency and never got any problem with it. And for sure I've never seen that module hangs on some request over timeout value. – Olvin Roght Jun 26 '22 at 08:29
  • There is only one thread using `requests` (see above). Are you sure it can create a concurrency problem then? And occurs after hours and thousands of successful calls... – Philipp Jun 26 '22 at 17:28
  • I wasn't taking the posted code literally, e.g. there's no .join() giving them a chance to run to completion. Ok, let's look at thread1. Are we _sure_ it is yielding? Do we see the core pegged, or mostly idle? I _hope_ that `some_external_event` lets us sleep until event happens, but the posted code doesn't make that completely obvious. Is there an opportunity to make debugging easier by inserting some voluntary yield into the middle of that `while True:` tight loop? Or at least maybe add the occasional helpful print()? Anything you can do to deliberately _provoke_ a race that causes the hang? – J_H Jun 26 '22 at 17:38
  • I think it's hard to understand just by my short snippet here. This script is never "completing", it runs as a linux service. The moment a new file (inotify) is found (thread1) in a directory it has to call an url (thread2) to purge the cache on an nginx server. I'm logging both threads and thread1 is definitely working fine. It nearly doesn't use any cpu (1-3%) since there is a new file in about every 3 seconds and then a pause of 30 minutes after 2500 files. I could show you the "real" script, it's just 100 lines... – Philipp Jun 26 '22 at 17:41
  • @J_H Here is the complete script: https://gist.github.com/pixply/2f07a7b69ccf1a22a6624ccef935c44f I will happily donate for a solution. – Philipp Jun 26 '22 at 17:54
  • Yup. Sorry, no flash insights, still a mystery. There is an opportunity to deliberately generate 1-second heartbeat FS events (touch file) to verify the watch_dir thread keeps getting scheduled, just to rule out that it dove into epoll(2) and didn't emerge for quite a while. There is a bigger opportunity to try architecting this threaded app as a multi-process app. If you keep it threaded, you might have the clear_caches thread hit a squid cache or nginx proxy, to verify that the probed URL is quickly returning a `200` document. Just grasping at straws, really.... – J_H Jun 27 '22 at 02:57
  • @J_H Thanks for the feedback. The `watch_dir` thread keeps on adding to the queue. I can see it in the logs from line 43 (https://gist.github.com/pixply/2f07a7b69ccf1a22a6624ccef935c44f#file-cache-py-L43). Is there any way that I can set a timeout for the execution of the line `requests.get...`, so that it just skips it completely if it halts for a certain time? – Philipp Jun 27 '22 at 10:07
  • We are in agreement that the current timeout of 3 should already be doing that (assuming that all components are obeying threading rules). We are stuck on figuring out why that's failing, why the thread gets stuck within the `.get()`. So my fallback position was to suggest doing the `.get()` in a separate (single thread) process. That would break apart the scheduling details, let us `sigkill()` the child process, and so on. I suppose an _alternate_ fallback position could be to let a thousand `.get()` threads bloom, spawning a new one for each URL then cleaning it up. Maybe only a few leak? – J_H Jun 27 '22 at 13:40