Limiting the text download content in pycurl

Question

I want to download site content using curl in python (pycurl). But I don't want the whole text of those sites just some part of the site. I want to reduce my time taken in downloading the whole text. Thankyou.

That's not how web requests work. You ask for a page, you get the page. — Amber, Jun 21 '11 at 06:49
@Kimvais in a general sense, yes, it is. There is some support for downloading certain byte offsets of files, but that's very rarely useful for selecting specific *text* - it's designed for breaking up downloads of files into chunks and/or resuming interrupted downloads. — Amber, Jun 21 '11 at 07:01

score 2 · Answer 1 · edited May 23 '17 at 12:20

2

You should set the relevant headers in your HTTP request, see this question on how to do it with pycurl

NOTE: This only works if you:

Know the data offset (in bytes) where in the result the data you want is
The web server supports this

edited May 23 '17 at 12:20

Community

1
1

answered Jun 21 '11 at 07:02

Kimvais

38,306
16
108
142

score 0 · Answer 2 · answered Jun 21 '11 at 07:14

The delay in loading a page, generally, is not in the actual download of the HTML -- that's often quite quick as html is nothing more than Unicode text. Unless there is a HUGE amount of actual text and markup on a page you're not going to save much. Further, in order to get any of the actual content of the page, you'll need to download the entire <head> anyway...

Personally, I would approach this asynchronously. Twisted is one of the more common suggestions for this type of approach.

Limiting the text download content in pycurl

2 Answers2