3

I'm using Python http.client.HTTPResponse.read() to read data from a stream. That is, the server keeps the connection open forever and sends data periodically as it becomes available. There is no expected length of response. In particular, I'm getting Tweets through the Twitter Streaming API.

To accomplish this, I repeatedly call http.client.HTTPResponse.read(1) to get the response, one byte at a time. The problem is that the program will hang on that line if there is no data to read, which there isn't for large periods of time (when no Tweets are coming in).

I'm looking for a method that will get a single byte of the HTTP response, if available, but that will fail instantly if there is no data to read.

I've read that you can set a timeout when the connection is created, but setting a timeout on the connection defeats the whole purpose of leaving it open for a long time waiting for data to come in. I don't want to set a timeout, I want to read data if there is data to be read, or fail if there is not, without waiting at all.

I'd like to do this with what I have now (using http.client), but if it's absolutely necessary that I use a different library to do this, then so be it. I'm trying to write this entirely myself, so suggesting that I use someone else's already-written Twitter API for Python is not what I'm looking for.

This code gets the response, it runs in a separate thread from the main one:

while True:
    try:
        readByte = dc.request.read(1)
    except:
        readByte = []

    if len(byte) != 0:
        dc.responseLock.acquire()
        dc.response = dc.response + chr(byte[0])
        dc.responseLock.release()

Note that the request is stored in dc.request and the response in dc.response, these are created elsewhere. dc.responseLock is a Lock that prevents dc.response from being accessed by multiple threads at once.

With this running on a separate thread, the main thread can then get dc.response, which contains the entire response received so far. New data is added to dc.response as it comes in without blocking the main thread.

This works perfectly when it's running, but I run into a problem when I want it to stop. I changed my while statement to while not dc.twitterAbort, so that when I want to abort this thread I just set dc.twitterAbort to True, and the thread will stop.

But it doesn't. This thread remains for a very long time afterward, stuck on the dc.request.read(1) part. There must be some sort of timeout, because it does eventually get back to the while statement and stop the thread, but it takes around 10 seconds for that to happen.

How can I get my thread to stop immediately when I want it to, if it's stuck on the call to read()?

Again, this method is working to get Tweets, the problem is only in getting it to stop. If I'm going about this entirely the wrong way, feel free to point me in the right direction. I'm new to Python, so I may be overlooking some easier way of going about this.

Community
  • 1
  • 1
gfrung4
  • 1,658
  • 3
  • 14
  • 22

1 Answers1

1

Your idea is not new, there are OS mechanisms(*) for making sure that an application is only calling I/O-related system calls when they are guaranteed to be not blocking . These mechanisms are usually used by async I/O frameworks, such as tornado or gevent. Use one of those, and you will find it very easy to run code "while" your application is waiting for an I/O event, such as waiting for incoming data on a socket.

If you use gevent's monkey-patching method, you can proceed using http.client, as requested. You just need to get used to the cooperative scheduling paradigm introduced by gevent/greenlets, in which your execution flow "jumps" between sub-routines.

Of course you can also perform blocking I/O in another thread (like you did), so that it does not affect the responsiveness of your main thread. Regarding your "How can I get my thread to stop immediately" problem:

  • Forcing a thread that's blocking in a system call to stop is usually not a clean or even valid process (also see Is there any way to kill a Thread in Python?). Either -- if your application has finished its jobs -- you take down the entire process, which also affects all contained threads, or you just leave the thread be and give it as much time to terminate as required (these 10 seconds you were referring to are not a problem -- are they?)

  • If you do not want to have such long-blocking system calls anywhere in your application (be it in the main thread or not), then use above-mentioned techniques to prevent blocking system calls.

(*) see e.g. O_NONBLOCK option in http://man7.org/linux/man-pages/man2/open.2.html

Community
  • 1
  • 1
Dr. Jan-Philip Gehrcke
  • 33,287
  • 14
  • 85
  • 130