Download webpage source up to a keyword

Question

I'm looking to download the source code from a website up to a particular keyword (The websites are all from a forum so I'm only interested in the source code for the first posts user details) so I only need to download the source code until I find "" for the first time in the source code.

How to get webpage title without downloading all the page source

This question although in a different language is quite similar to what I'm looking to do although I'm not that experienced with python so I can't figure out how to recode that answer into python.

don't know what happened the editor parsed the html comment out of my question, I've fixed it now — puma pops, Apr 06 '13 at 00:00
No matter what you do, you will receive the entire web page in one response. Load it into xml.dom.minidom (or something similar) to extract the portions you need. — Dave Lasley, Apr 06 '13 at 00:03
Why are you trying to do this? If you're trying to download gigantic pages and save network bandwidth or server load by closing the socket as soon as you see that magic string… well, that's going to need something lower-level than the simple Python URL-opening functions, and may not save any bandwidth or server load anyway. If that's not your goal, please explain what the goal is. — abarnert, Apr 06 '13 at 00:12
I'm not worried about the bandwidth, its more reducing the time its going to take to download each page source is the goal — puma pops, Apr 06 '13 at 00:18
How big are these pages, and how long do they take to download? Before doing any work, just stick a typical request into a big string and write a trivial test program that creates a socket, connects, sends that string, does a sequence of non-blocking `recv`s, and see how much you get back in each one. If you're usually getting back the entire page in the first read or two, then there's really no point in trying to stop reading early. — abarnert, Apr 06 '13 at 01:02

abarnert · Accepted Answer · 2013-04-06T01:40:37.583

First, be aware that you may have already gotten all or most of each page into your OS buffers, NIC, router, or ISP before you cancel, so there may be no benefit at all to doing this. And there will be a cost—you can't reuse connections if you close them early; you have to recv smaller pieces at a time if you want to be able to cancel early; etc.

If you have a rough idea of how many bytes you probably need to read (better to often go a little bit over than to sometimes go a little bit under), and the server handles HTTP range requests, you may want to try that instead of requesting the entire file and then closing the socket early.

But, if you want to know how to close the socket early:

urllib2.urlopen, requests, and most other high-level libraries are designed around the idea that you're going to want to read the whole file. They buffer the data as it comes in, to give you a high-level file-like interface. On top of that, their API is blocking. Neither of that is what you want. You want to get the bytes as they come in, as fast as possible, and when you close the socket, you want that to be as soon after the recv as possible.

So, you may want to consider using one of the Python wrappers around libcurl, which gives you a pretty good balance between power/flexibility and ease-of-use. For example, with pycurl:

import pycurl

buf = ''

def callback(newbuf):
    global buf
    buf += newbuf
    if '<div style="float: right; margin-left: 8px;">' in buf:
        return 0
    return len(newbuf)

c = pycurl.Curl()
c.setopt(c.URL, 'http://curl.haxx.se/dev/')
c.setopt(c.WRITEFUNCTION, callback)
try:
    c.perform()
except Exception as e:
    print(e)
c.close()

print len(buf)

As it turns out, this ends up reading 12259/12259 bytes on that test. But if I change it to a string that comes in the first 2650 bytes, I only read 2650/12259 bytes. And if I fire up Wireshark and instrument recv, I can see that, although the next packet did arrive at my NIC, I never actually read it; I closed the socket immediately after receiving 2650 bytes. So, that might save some time… although probably not too much. More importantly, though, if I throw it at a 13MB image file and try to stop after 1MB, I only receive a few KB extra, and most of the image hasn't even made it to my router yet (although it may have all left the server, if you care at all about being nice to the server), so that definitely will save some time.

Of course a typical forum page is a lot closer to 12KB than to 13MB. (This page, for example, was well under 48KB even after all my rambling.) But maybe you're dealing with atypical forums.

If the pages are really big, you may want to change the code to only check buf[-len(needle):] + newbuf instead of the whole buffer each time. Even with a 13MB image, searching the whole thing over and over again didn't add much to the total runtime, but it did raise my CPU usage from 1% to 9%…

One last thing: If you're reading from, say, 500 pages, doing them concurrently—say, 8 at a time—is probably going to save you a lot more time than just canceling each one early. Both together might well be better than either on its own, so that's not an argument against doing this—it's just a suggestion to do that as well. (See the receiver-multi.py sample if you want to let curl handle the concurrency for you… or just use multiprocessing or concurrent.futures to use a pool of child processes.)

Download webpage source up to a keyword

1 Answers1