1

Is it possible to fetch the first few, say 1K, of a webpage using python?

Thank you very much!

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
Leslie G
  • 309
  • 2
  • 10
  • Can you give a little more detail? – Anirudh Ramanathan Jun 25 '12 at 04:03
  • 2
    I don't know of any specific functionality or library in Python that would achieve this easily, but it sounds like what you're looking to do is simply make an HTTP request, read the response, and ignore anything past 1K bytes in the response. Essentially you'd be reading in a stream and would simply stop reading after 1K bytes. – David Jun 25 '12 at 04:08
  • 1
    see http://stackoverflow.com/questions/4362721/limiting-response-size-with-httplib2 – iMom0 Jun 25 '12 at 04:09
  • see also "chunked encoding" and this post using `urllib2` http://stackoverflow.com/questions/2028517/python-urllib2-progress-hook – snies Jun 25 '12 at 04:29

3 Answers3

6

The Requests library lets you iterate over the response as it comes in so you could do something like this:

import requests
beginning = requests.get('http://example.com/').iter_content(1024).next()

If you just want the headers you can always use the the http HEAD method:

req = requests.head('http://example.com')
Trevor
  • 9,518
  • 2
  • 25
  • 26
  • 1
    This still would fetch the whole page though. Anyway, `requests` is a great library. – Torsten Engelbrecht Jun 25 '12 at 04:39
  • @Torsten looking through the source code, it appears to only read off the socket the chunk size requested. It's a long rabbit hole though and I'm not quite sure. It goes requests -> urllib3 -> httplib -> raw socket and it looks like its streaming all the way. – Trevor Jun 25 '12 at 05:07
  • @Trevor. Thanks for the clarification. Then I guess my assumption was wrong. I never used requests like this. So when doing just `request.get('...')` without chaining other methods to it it will download the whole response. I was simply assuming this is the case here as well before applying '.iter_content' to it. – Torsten Engelbrecht Jun 25 '12 at 07:24
  • @Torsten It appears that it waits till you try and access the `content` property (or other similar properties) and then uses `iter_content` internally to build up the full response and cache it. This is where I'm looking in the source: https://github.com/kennethreitz/requests/blob/develop/requests/models.py#L756 – Trevor Jun 25 '12 at 07:48
0

Here's an example using Python 3's urllib.request, which is built in.

import urllib.request
url = urllib.request.openurl("http://example.com").read(1024)
Sean Johnson
  • 5,567
  • 2
  • 17
  • 22
0

Sure:

>>> len(urllib2.urlopen('http://google.com').read(1024))
1024
Roman Bodnarchuk
  • 29,461
  • 12
  • 59
  • 75