This code downloads data from a URL, treating it as binary content:
public class Download {
private static void download(URL input, File output)
throws IOException {
InputStream in = input.openStream();
try {
OutputStream out = new FileOutputStream(output);
try {
copy(in, out);
} finally {
out.close();
}
} finally {
in.close();
}
}
private static void copy(InputStream in, OutputStream out)
throws IOException {
byte[] buffer = new byte[1024];
while (true) {
int readCount = in.read(buffer);
if (readCount == -1) {
break;
}
out.write(buffer, 0, readCount);
}
}
public static void main(String[] args) {
try {
URL url = new URL("http://stackoverflow.com");
File file = new File("data");
download(url, file);
} catch (IOException e) {
e.printStackTrace();
}
}
}
The downside of this approach is that it ignores any meta-data, like the Content-Type, which you would get from using HttpURLConnection (or a more sophisticated API, like the Apache one).
In order to parse the HTML data, you'll either need a specialized HTML parser that can handle poorly formed markup or tidy it first before parsing using a XML parser.