Downloading only a small part of an HTML document

Question

I've thought about implementing an Android application that will download data from a third-party website. The website contains multiple pages, each one containing only about 200 bytes of useful data and around 20 Kbytes of data that I don't want to use. Is there any way to download only that part of a document? Or somehow filter the data to minimize the amount of downloaded information? Thanks in advance.

See this SO-post: http://stackoverflow.com/questions/3414438/java-resume-download-in-urlconnection — dacwe, Feb 23 '12 at 10:24
@dacwe, I'm not talking about resuming download, I want to download only a part of an HTML document. — Egor, Feb 23 '12 at 10:28
The accepted answer to that other question tells you exactly how to download a specified range of a document, which is what you are asking for. — Graham Borland, Feb 23 '12 at 10:31

Borodin · Accepted Answer · 2012-03-16T03:34:50.253

You need the Range HTTP request header, with which you can specify a start and end inclusive byte address within the resource.

Range: bytes=0-99

will retrieve the first 100 bytes, as the header specifies the region from the first byte (at offset zero) up to and including the 100th byte (at offset 99). Likewise

Range: bytes=0-0

will retrieve the first byte.

Get it working first with the BBC web site, that I know honours the Range header.

use strict;
use warnings;

use LWP;

my $ua = LWP::UserAgent->new;
my $req = HTTP::Request->new('GET', 'http://www.bbc.co.uk/');
$req->header('Range', 'bytes=0-13');
my $resp = $ua->request($req);

print $resp->decoded_content;

This returns the first fourteen bytes of the page <!DOCTYPE html.

Then plug in our own site. If it still gives you the whole site then you're out of lukc and you can't restict what's returned I'm afraid.

It wouldn't be fair to leave you with just a Perl version. Here's the Java

DefaultHttpClient client = new DefaultHttpClient();

HttpGet req = new HttpGet("http://www.bbc.co.uk/");
req.setHeader("Range", "bytes=0-13");
HttpResponse resp = client.execute(req);
HttpEntity ent = resp.getEntity();
String content = EntityUtils.toString(ent);

System.out.println(resp.getStatusLine());
System.out.println(ent.getContentLength());
System.out.println(content);

which outputs

HTTP/1.1 206 Partial Content
14
<!DOCTYPE html

showing that just 14 bytes have been read. Plug your URL into this and see if it behaves.

+1, just note that servers don't always honour the range requests. — dacwe, Feb 23 '12 at 10:44
Yes dacwe, but then there is no way to retrieve anything but the entire resource. — Borodin, Feb 23 '12 at 10:46
@borodin, Tried the following code: HttpGet getRequest = new HttpGet(url); getRequest.addHeader("Range", "bytes=0-199"); And it doesn't seem to work. Still downloads the whole page. — Egor, Feb 23 '12 at 11:04
You may have a site that ignores the `Range` header. I've updated my answer to give you something to test. — Borodin, Feb 23 '12 at 14:04

score 0 · Answer 2 · answered Feb 23 '12 at 10:34

0

If the Sites are always very similar you can use skip(n)-Method of InputStream to skip n bytes.

answered Feb 23 '12 at 10:34

Thommy

5,070
2
28
51

1

You still "download" the data - then you skip it. – dacwe Feb 23 '12 at 10:38

Downloading only a small part of an HTML document

2 Answers2