1

I want to download site content using curl in python (pycurl). But I don't want the whole text of those sites just some part of the site. I want to reduce my time taken in downloading the whole text. Thankyou.

manofsins
  • 1,583
  • 2
  • 10
  • 12
  • That's not how web requests work. You ask for a page, you get the page. – Amber Jun 21 '11 at 06:49
  • @Amber no, that's not how they work. – Kimvais Jun 21 '11 at 06:58
  • @Kimvais in a general sense, yes, it is. There is some support for downloading certain byte offsets of files, but that's very rarely useful for selecting specific *text* - it's designed for breaking up downloads of files into chunks and/or resuming interrupted downloads. – Amber Jun 21 '11 at 07:01

2 Answers2

2

You should set the relevant headers in your HTTP request, see this question on how to do it with pycurl

NOTE: This only works if you:

  1. Know the data offset (in bytes) where in the result the data you want is
  2. The web server supports this
Community
  • 1
  • 1
Kimvais
  • 38,306
  • 16
  • 108
  • 142
0

The delay in loading a page, generally, is not in the actual download of the HTML -- that's often quite quick as html is nothing more than Unicode text. Unless there is a HUGE amount of actual text and markup on a page you're not going to save much. Further, in order to get any of the actual content of the page, you'll need to download the entire <head> anyway...

Personally, I would approach this asynchronously. Twisted is one of the more common suggestions for this type of approach.

cwallenpoole
  • 79,954
  • 26
  • 128
  • 166