First, be aware that you may have already gotten all or most of each page into your OS buffers, NIC, router, or ISP before you cancel, so there may be no benefit at all to doing this. And there will be a cost—you can't reuse connections if you close them early; you have to recv
smaller pieces at a time if you want to be able to cancel early; etc.
If you have a rough idea of how many bytes you probably need to read (better to often go a little bit over than to sometimes go a little bit under), and the server handles HTTP range requests, you may want to try that instead of requesting the entire file and then closing the socket early.
But, if you want to know how to close the socket early:
urllib2.urlopen
, requests
, and most other high-level libraries are designed around the idea that you're going to want to read the whole file. They buffer the data as it comes in, to give you a high-level file-like interface. On top of that, their API is blocking. Neither of that is what you want. You want to get the bytes as they come in, as fast as possible, and when you close the socket, you want that to be as soon after the recv
as possible.
So, you may want to consider using one of the Python wrappers around libcurl
, which gives you a pretty good balance between power/flexibility and ease-of-use. For example, with pycurl
:
import pycurl
buf = ''
def callback(newbuf):
global buf
buf += newbuf
if '<div style="float: right; margin-left: 8px;">' in buf:
return 0
return len(newbuf)
c = pycurl.Curl()
c.setopt(c.URL, 'http://curl.haxx.se/dev/')
c.setopt(c.WRITEFUNCTION, callback)
try:
c.perform()
except Exception as e:
print(e)
c.close()
print len(buf)
As it turns out, this ends up reading 12259/12259 bytes on that test. But if I change it to a string that comes in the first 2650 bytes, I only read 2650/12259 bytes. And if I fire up Wireshark and instrument recv
, I can see that, although the next packet did arrive at my NIC, I never actually read it; I closed the socket immediately after receiving 2650 bytes. So, that might save some time… although probably not too much. More importantly, though, if I throw it at a 13MB image file and try to stop after 1MB, I only receive a few KB extra, and most of the image hasn't even made it to my router yet (although it may have all left the server, if you care at all about being nice to the server), so that definitely will save some time.
Of course a typical forum page is a lot closer to 12KB than to 13MB. (This page, for example, was well under 48KB even after all my rambling.) But maybe you're dealing with atypical forums.
If the pages are really big, you may want to change the code to only check buf[-len(needle):] + newbuf
instead of the whole buffer each time. Even with a 13MB image, searching the whole thing over and over again didn't add much to the total runtime, but it did raise my CPU usage from 1% to 9%…
One last thing: If you're reading from, say, 500 pages, doing them concurrently—say, 8 at a time—is probably going to save you a lot more time than just canceling each one early. Both together might well be better than either on its own, so that's not an argument against doing this—it's just a suggestion to do that as well. (See the receiver-multi.py
sample if you want to let curl
handle the concurrency for you… or just use multiprocessing
or concurrent.futures
to use a pool of child processes.)