I've thought about implementing an Android application that will download data from a third-party website. The website contains multiple pages, each one containing only about 200 bytes of useful data and around 20 Kbytes of data that I don't want to use. Is there any way to download only that part of a document? Or somehow filter the data to minimize the amount of downloaded information? Thanks in advance.
-
2See this SO-post: http://stackoverflow.com/questions/3414438/java-resume-download-in-urlconnection – dacwe Feb 23 '12 at 10:24
-
@dacwe, I'm not talking about resuming download, I want to download only a part of an HTML document. – Egor Feb 23 '12 at 10:28
-
1The accepted answer to that other question tells you exactly how to download a specified range of a document, which is what you are asking for. – Graham Borland Feb 23 '12 at 10:31
2 Answers
You need the Range
HTTP request header, with which you can specify a start and end inclusive byte address within the resource.
Range: bytes=0-99
will retrieve the first 100 bytes, as the header specifies the region from the first byte (at offset zero) up to and including the 100th byte (at offset 99). Likewise
Range: bytes=0-0
will retrieve the first byte.
Get it working first with the BBC web site, that I know honours the Range
header.
use strict;
use warnings;
use LWP;
my $ua = LWP::UserAgent->new;
my $req = HTTP::Request->new('GET', 'http://www.bbc.co.uk/');
$req->header('Range', 'bytes=0-13');
my $resp = $ua->request($req);
print $resp->decoded_content;
This returns the first fourteen bytes of the page <!DOCTYPE html
.
Then plug in our own site. If it still gives you the whole site then you're out of lukc and you can't restict what's returned I'm afraid.
It wouldn't be fair to leave you with just a Perl version. Here's the Java
DefaultHttpClient client = new DefaultHttpClient();
HttpGet req = new HttpGet("http://www.bbc.co.uk/");
req.setHeader("Range", "bytes=0-13");
HttpResponse resp = client.execute(req);
HttpEntity ent = resp.getEntity();
String content = EntityUtils.toString(ent);
System.out.println(resp.getStatusLine());
System.out.println(ent.getContentLength());
System.out.println(content);
which outputs
HTTP/1.1 206 Partial Content
14
<!DOCTYPE html
showing that just 14 bytes have been read. Plug your URL into this and see if it behaves.

- 126,100
- 9
- 70
- 144
-
2
-
Yes dacwe, but then there is no way to retrieve anything but the entire resource. – Borodin Feb 23 '12 at 10:46
-
@borodin, Tried the following code: HttpGet getRequest = new HttpGet(url); getRequest.addHeader("Range", "bytes=0-199"); And it doesn't seem to work. Still downloads the whole page. – Egor Feb 23 '12 at 11:04
-
You may have a site that ignores the `Range` header. I've updated my answer to give you something to test. – Borodin Feb 23 '12 at 14:04
-
If the Sites are always very similar you can use skip(n)
-Method of InputStream
to skip n bytes.

- 5,070
- 2
- 28
- 51