I want to download site content using curl in python (pycurl). But I don't want the whole text of those sites just some part of the site. I want to reduce my time taken in downloading the whole text. Thankyou.
Asked
Active
Viewed 367 times
1
-
That's not how web requests work. You ask for a page, you get the page. – Amber Jun 21 '11 at 06:49
-
@Amber no, that's not how they work. – Kimvais Jun 21 '11 at 06:58
-
@Kimvais in a general sense, yes, it is. There is some support for downloading certain byte offsets of files, but that's very rarely useful for selecting specific *text* - it's designed for breaking up downloads of files into chunks and/or resuming interrupted downloads. – Amber Jun 21 '11 at 07:01
2 Answers
2
You should set the relevant headers in your HTTP request, see this question on how to do it with pycurl
NOTE: This only works if you:
- Know the data offset (in bytes) where in the result the data you want is
- The web server supports this
0
The delay in loading a page, generally, is not in the actual download of the HTML -- that's often quite quick as html is nothing more than Unicode text. Unless there is a HUGE amount of actual text and markup on a page you're not going to save much. Further, in order to get any of the actual content of the page, you'll need to download the entire <head>
anyway...
Personally, I would approach this asynchronously. Twisted is one of the more common suggestions for this type of approach.

cwallenpoole
- 79,954
- 26
- 128
- 166