Is it possible to fetch the first few, say 1K, of a webpage using python?

Question

Thank you very much!

I don't know of any specific functionality or library in Python that would achieve this easily, but it sounds like what you're looking to do is simply make an HTTP request, read the response, and ignore anything past 1K bytes in the response. Essentially you'd be reading in a stream and would simply stop reading after 1K bytes. — David, Jun 25 '12 at 04:08
see http://stackoverflow.com/questions/4362721/limiting-response-size-with-httplib2 — iMom0, Jun 25 '12 at 04:09
see also "chunked encoding" and this post using `urllib2` http://stackoverflow.com/questions/2028517/python-urllib2-progress-hook — snies, Jun 25 '12 at 04:29

score 6 · Accepted Answer · answered Jun 25 '12 at 04:23

6

The Requests library lets you iterate over the response as it comes in so you could do something like this:

import requests
beginning = requests.get('http://example.com/').iter_content(1024).next()

If you just want the headers you can always use the the http HEAD method:

req = requests.head('http://example.com')

answered Jun 25 '12 at 04:23

Trevor

1

This still would fetch the whole page though. Anyway, `requests` is a great library. – Torsten Engelbrecht Jun 25 '12 at 04:39
@Torsten looking through the source code, it appears to only read off the socket the chunk size requested. It's a long rabbit hole though and I'm not quite sure. It goes requests -> urllib3 -> httplib -> raw socket and it looks like its streaming all the way. – Trevor Jun 25 '12 at 05:07
@Trevor. Thanks for the clarification. Then I guess my assumption was wrong. I never used requests like this. So when doing just `request.get('...')` without chaining other methods to it it will download the whole response. I was simply assuming this is the case here as well before applying '.iter_content' to it. – Torsten Engelbrecht Jun 25 '12 at 07:24
@Torsten It appears that it waits till you try and access the `content` property (or other similar properties) and then uses `iter_content` internally to build up the full response and cache it. This is where I'm looking in the source: https://github.com/kennethreitz/requests/blob/develop/requests/models.py#L756 – Trevor Jun 25 '12 at 07:48

Sean Johnson · Answer 2 · 2012-06-25T05:57:20.050

0

Here's an example using Python 3's urllib.request, which is built in.

import urllib.request
url = urllib.request.openurl("http://example.com").read(1024)

edited Jun 25 '12 at 05:57

answered Jun 25 '12 at 05:45

Sean Johnson

score 0 · Answer 3 · answered Jun 25 '12 at 06:35

0

Sure:

>>> len(urllib2.urlopen('http://google.com').read(1024))
1024

answered Jun 25 '12 at 06:35

Roman Bodnarchuk

3 Answers3