How do I gracefully interrupt urllib2 downloads?

Question

I am using urllib2's build_opener() to create an OpenerDirector. I am using the OpenerDirector to fetch a slow page and so it has a large timeout.

So far, so good.

However, in another thread, I have been told to abort the download - let's say the user has selected to exit the program in the GUI.

Is there a way to signal an urllib2 download should quit?

Some more information about reasons for this would be useful - as you've stated, killing a thread is really underisable, other approaches depend if you can daemonise processes, and then have a shared Queue etc... — Jon Clements, Aug 05 '12 at 15:40
Eli Bendersky has great posts about stopping work on Python threads. Read [here](http://eli.thegreenplace.net/2011/08/22/how-not-to-set-a-timeout-on-a-computation-in-python/) and [here](http://eli.thegreenplace.net/2011/12/27/python-threads-communication-and-stopping/) — JBernardo, Aug 05 '12 at 15:41
@JBernardo: Well that's gone and dashed that plan. TerminateThread (windows) can lead to deadlock? Ugh. I said it was ugly, but I didn't realise it would be dangerous. — Oddthinking, Aug 05 '12 at 16:30
@Jon: I am not sure what further information you may required. One thread is fetching a web-page which is slow to return. The other thread is tasked with shutting it down, because the user cancelled. Daemon *threads* will only work when the whole app is shutting down, but that may be useful. — Oddthinking, Aug 05 '12 at 16:35
Okay, then daemonise the process handling the threads - then establish a pipe/RPC etc... to tell it to "die" - then it can handle clean up — Jon Clements, Aug 05 '12 at 16:55
Excellent question. Can't offer any suggestions. We had to work around a number of stdlib limitations while writing the IA wayback machine's [live web proxy](https://github.com/internetarchive/liveweb). Interruption was not one of our requirements but fine tuned timeouts were. — Noufal Ibrahim, Aug 05 '12 at 16:58
Presumably the client listening to the pipe connection would have a very short time-out, so it could check if it had been shut-down. This is effective if the command from the user is "Shut down the application." If it is "Abort the current operation, and stop wasting the server's time, so it can focus on my next task.", this daemonising won't help. Nonetheless, put it as an answer and I will give it a vote. — Oddthinking, Aug 05 '12 at 16:59
@Noufal: I am still reeling a little that it wasn't much simpler to answer. Seems a common enough issue; I didn't realise it was a Python minefield. — Oddthinking, Aug 05 '12 at 17:00
@Oddthinking: Python comes with batteries included but some of the batteries are [voltaic piles](http://en.wikipedia.org/wiki/Voltaic_pile) that are nice to exhibit and not much more. — Noufal Ibrahim, Aug 05 '12 at 17:04
@Oddthinking It's not a Python issue. Every language you try will warn about killing threads (because of memory leakages) or not allow you to do that. You can use the `multiprocess` package if performance is not an issue — JBernardo, Aug 05 '12 at 17:09
@JBernardo: Ada is coming up to its 30th birthday. Aborting tasks (a.k.a. threads) is part of [the core language](http://www.adaic.org/resources/add_content/standards/05rm/html/RM-9-8.html). Multiprocessing is a legitimate answer though. — Oddthinking, Aug 05 '12 at 17:20
@Oddthinking I don't see the word `thread` there (I don't know how it's implemented). Also, if they used threads, probably someone took care of the problems it could generate. Killing a thread may generate so many problems in a interpreted language (as Python) that it is a very bad idea. — JBernardo, Aug 05 '12 at 17:24
More thoughts: 1) If the request is running in the main process thread, you could generate SIGINT to interrupt it. 2) If the server is slow because it is performing an expensive computation, then perhaps it is best to implement the abort request on the server rather than the client. — Luke, Aug 05 '12 at 18:24
@JBernardo: For the record, Ada tasks are similar to threads; they are individual units of concurrency that run in parallel, and that share memory space, but each have their own stack. The choice between co-operative versus pre-emptive wasn't set by the language. I never used Ada tasks on *nix systems, so I don't know how they are actually implemented; if all blocking operations accepted interrupt signals, that would have been sufficient (which brings the "blame" back to Python; its standard libraries should ideally accept interrupt messages on all blocking ops.) — Oddthinking, Aug 06 '12 at 14:29

score 10 · Accepted Answer · edited May 23 '17 at 10:29

There is no clean answer. There are several ugly ones.

Initially, I was putting rejected ideas in the question. As it has become clear that there are no right answers, I decided to post the various sub-optimal alternatives as a list answer. Some of these are inspired by comments, thank you.

Library Support

An ideal solution would be if OpenerDirector offered a cancel operator.

It does not. Library writers take note: if you provide long slow operations, you need to provide a way to cancel them if people are to use them in real-world applications.

Reduce timeout

As a general solution for others, this may work. With a smaller timeout, it would be more responsive to the changes in circumstances. However, it will also cause downloads to fail if they weren't completely finished in the timeout time, so this is a trade-off. In my situation, it is untenable.

Read the download in chunks.

Again, as a general solution, this may work. If the download consists of very large files, you can read them in small chunks, and abort after a chunk is read.

Unfortunately, if (as in my case) the delay is in receiving the first byte, rather than the size of the file, this will not help.

Kill the entire thread.

While there are some aggressive techniques to kill threads, depending on the operating system, they are not recommended. In particular, they can cause deadlocks to occur. See Eli Bendersky's two articles (via @JBernardo).

Just be unresponsive

If the abort operation has been triggered by the user, it may be simplest to just be unresponsive, and not act on the request until the open operation has completed.

Whether this unresponsiveness is acceptable to your users (hint: no!), is up to your project.

It also continues to place a demand on the server, even if the result is known to be unneeded.

Let it peter out in another thread.

If you create a separate thread to run the operation, and then communicate with that thread in an interruptable manner, you could discard the blocked thread, and start working on the next operation instead. Eventually, the thread will unblock and then it can gracefully shut-down.

The thread should be a daemon, so it doesn't block the total shut-down of the application.

This will give the user responsiveness, but it means that the server that will need to continue to support it, even though the result is not needed.

Rewrite the socket methods to be polling-based.

As described in @Luke's answer, it may be possible to provide (fragile?, unportable?) extensions to the standard Python libraries.

His solution changes the socket operations from blocking to polling. Another might allow shutdown through the socket.shutdown() method (if that, indeed, will interrupt a blocked socket - not tested.)

A solution based on Twisted may be cleaner. See below.

Replace the sockets with asynchronous, non-thread-based libraries.

The Twisted framework provides a replacement set of libraries for network operations that are event-driven. I understand this means that all of the different communications can be handled by a single-thread with no blocking.

Sabotage

It may be possible to navigate the OpenerDirector, to find the baselevel socket that is blocking, and sabotage it directly (Will socket.shutdown() be sufficient?) to make it return.

Yuck.

Put it in a separate (killable) process

The thread that reads the socket can be moved into a separate process, and interprocess communication can be used to transmit the result. This IPC can be aborted early by the client, and then the whole process can be killed.

Ask the Web Server to cancel

If you have control over the web-server being read, it could be sent a separate message asking it to close the socket. That should cause the blocked client to react.

score 3 · Answer 2 · edited May 23 '17 at 11:55

3

I don't see any built-in mechanism to accomplish this. I would just move the OpenerDirector out to its own ~~thread~~ process so it would be safe to kill it.

Note: there is no way to 'kill' a thread in python (thanks JBernardo). It may, however, be possible to generate an exception in the thread, but it's likely this won't work if the thread is blocking on a socket.

edited May 23 '17 at 11:55

Community

1
1

answered Aug 05 '12 at 15:11

Luke

11,374
2
48
61

4

You [cannot simply kill a thread in Python](http://eli.thegreenplace.net/2011/08/22/how-not-to-set-a-timeout-on-a-computation-in-python/) – JBernardo Aug 05 '12 at 15:39

Luke · Answer 3 · 2012-08-05T18:12:07.507

Here's a start for another approach. It works by extending part of the httplib stack to include a non-blocking check for the server response. You would have to make a few changes to implement this within your thread. Also note that it uses some undocumented bits of urllib2 and httplib, so the final solution for you will probably depend on the version of Python you are using (I have 2.7.3). Poke around in your urllib2.py and httplib.py files; they're quite readable.

import urllib2, httplib, select, time

class Response(httplib.HTTPResponse):
    def _read_status(self):
        ## Do non-blocking checks for server response until something arrives.
        while True:
            sel = select.select([self.fp.fileno()], [], [], 0)
            if len(sel[0]) > 0:
                break
            ## <--- Right here, check to see whether thread has requested to stop
            ##      Also check to see whether timeout has elapsed
            time.sleep(0.1)
        return httplib.HTTPResponse._read_status(self)

class Connection(httplib.HTTPConnection):
    response_class = Response

class Handler(urllib2.HTTPHandler):
    def http_open(self, req):
        return self.do_open(Connection, req)

h = Handler()
o = urllib2.build_opener(h)
f = o.open(url)
print f.read()

Also note that there are many places in the stack that could potentially block; this example only covers one of them--the server has received the request but takes a long time to respond.

score 0 · Answer 4 · answered Sep 20 '14 at 18:30

I find an approach with placing all your urllib-related jobs in threads most appropriate one because of blocking nature of urllib. Then it's possible to abort tasks altogether, including requests. Killing threads is indeed unsafe but exceptions raising should be safe.

So this is how to raise an exception in a thread (doc):

import ctypes
ctypes.pythonapi.PyThreadState_SetAsyncExc(ctypes.c_long(your_thread.ident),
                                           ctypes.py_object(your_exception))

If the socket at the moment would be in a blocking (connecting) state, an exception will be raised immediately after the thread will become alive again.