3

We only want a particular element from the HTML document at nytimes.com/technology. This page contains many articles, but we only want the article's title, which is in a . If we use wget, cURL, or any other tools or some package like requests in Python , whole HTML document is returned. Can we limite the returned data to specific element, such as the 's?

AndriuZ
  • 648
  • 6
  • 26
Sravan
  • 553
  • 8
  • 15
  • Although it is not what you are looking for but you may want to look at the following question: http://stackoverflow.com/questions/1538952/retrieve-partial-web-page The thing you are looking for may not be possible because you need to be able to parse the DOM to access its elements however without having the all document, parsing will be very hard. – reader_1000 Sep 26 '11 at 14:11

3 Answers3

4

The HTTP protocol knows nothing about HTML or DOM. Using HTTP you can fetch partial documents from supporting web servers using the Content-Range header, but you'll need to know the byte offsets of the data you want.

The short answer is that the web service itself must support what you're requesting. It is not something that can be provided at the HTTP layer.

Rob Napier
  • 286,113
  • 34
  • 456
  • 610
  • thank you so much! Could you please also mention how can we get this done if we do know the offset? – Sravan Sep 26 '11 at 15:37
  • The apache docs include a lot of examples of how to set the headers: http://labs.apache.org/webarch/http/draft-fielding-http/p5-range.html. This blog post includes a good example in PHP and curl: http://www.ankur.com/blog/106/php/resume-http-downloads-php-curl-fsockopen/ – Rob Napier Sep 26 '11 at 16:42
1

If you are specifically wanting to process parts of an HTML document located at the ny times url you give, you are probably going about it the wrong way. If you just want a list of the articles, by headline for instance, then what you want is the web feed. In this case, the times publishes an RSS feed from that very category of articles. Note, if you hit this page with a browser, the browser will recognize it as a feed and handle it at higher level, i.e. ask if you want to subscribe to the feed. But, you can hit this with curl and see an unparsed stream of XML. Each item in the feed will represent an article and contain meta data such as a URL to the full article, the title, etc.

Also note that there is probably some web feed specific packages to whatever language platform you are using that will give you high level access to the feed data. This will allow you to write code like:

foreach ( article in feed )
    title = article.getTitle();

rather than parsing the xml your self.

chad
  • 7,369
  • 6
  • 37
  • 56
  • yes, the question is ambiguous.Thanks for pointing out. I meant we know exactly where the element sits in DOM. I edited the question to reflect the same. And the environment I intended is command line client or any package in any programming language. – Sravan Sep 26 '11 at 15:34
0

Yes, cURL does have the ability to only download the HTML file headers and not the rest of the content. Use the -I switch to issue a HEAD http request.

From the Man page:

-I, --head

(HTTP/FTP/FILE) Fetch the HTTP-header only! HTTP-servers feature the command HEAD which this uses to get nothing but the header of a document. When used on a FTP or FILE file, curl displays the file size and last modification time only.
cdeszaq
  • 30,869
  • 25
  • 117
  • 173
  • 1
    I believe the OP is using "headers" here to refer to

    headers inside of the HTML. If you look at the example he's given (nytimes.com/technology), that's how they present their headlines.

    – Rob Napier Sep 26 '11 at 14:12
  • @RobNapier - Ahh, I see. My bad. I got put on the wrong track when saw `wget` and `cURL`. That makes the stuff about the DOM make a bit more sense now. – cdeszaq Sep 26 '11 at 14:17