Why a downloaded file can get corrupted?

Question

I have been trying to download a pdf file from the following URL: http://pdfobject.com/markup/examples/full-browser-window.html

Josh M suggested the following solution that works on his computer. However, I cannot get it to work. I mean the following code saves the file to the destination, however, the downloaded file's weight is only 984 bytes (normally should be 18 Kb). So the file is corrupted. I cannot think of any reason of why this could happen?

import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.net.URL;
import java.net.URLConnection;
import java.nio.file.Files;
import java.nio.file.StandardOpenOption;

public final class FileDownloader {

    private FileDownloader(){}

    public static void main(String args[]) throws IOException{
        download("http://pdfobject.com/markup/examples/full-browser-window.html", new File("C:\\Users\\Owner\\Desktop\\temporary\\myFile.pdf"));
        download2("http://pdfobject.com/markup/examples/full-browser-window.html", new File("C:\\Users\\Owner\\Desktop\\temporary\\myFile2.pdf"));
    }

    public static void download(final String url, final File destination) throws IOException {
        final URLConnection connection = new URL(url).openConnection();
        connection.setConnectTimeout(60000);
        connection.setReadTimeout(60000);
        connection.addRequestProperty("User-Agent", "Mozilla/5.0");
        final ByteArrayOutputStream baos = new ByteArrayOutputStream();
        final byte[] buffer = new byte[2048];
        int read;
        final InputStream input = connection.getInputStream();
        while((read = input.read(buffer)) > -1)
            baos.write(buffer, 0, read);
        baos.flush();
        Files.write(destination.toPath(), baos.toByteArray(), StandardOpenOption.WRITE);
        input.close();
    }

    public static void download2(final String url, final File destination) throws IOException {
        final URLConnection connection = new URL(url).openConnection();
        connection.setConnectTimeout(60000);
        connection.setReadTimeout(60000);
        connection.addRequestProperty("User-Agent", "Mozilla/5.0");
        final FileOutputStream output = new FileOutputStream(destination, false);
        final byte[] buffer = new byte[2048];
        int read;
        final InputStream input = connection.getInputStream();
        while((read = input.read(buffer)) > -1)
            output.write(buffer, 0, read);
        output.flush();
        output.close();
        input.close();
    }
}

I can only say, if that didn't solved your problem then you shouldn't have marked that post as an answer. — Luiggi Mendoza, Oct 11 '13 at 03:35
The `download()` version 1 with the `ByteArrayOutputStream` is pointless, just a complete waste of time and space. `download2()` should work perfectly, if no exceptions are thrown, although you don't need the `output.flush()` call. — user207421, Oct 11 '13 at 03:36
Well in general that post answers that question ... I mean it could be helpful for other people ... It does not work for me in particular ... but i do not think that there is a problem in the code since it worked on his computer ... I think that there could be some other issue — CHEBURASHKA, Oct 11 '13 at 03:37
You might be running into a cacheing problem at your ISP. Do you get the complete file if you paste the URL into your browser? — user207421, Oct 11 '13 at 03:38
I tried both `download` and `download2` but the save file is only 984 bytes which shows that the files were not saved correctly — CHEBURASHKA, Oct 11 '13 at 03:40
Why did you change the URL from http://pdfobject.com/pdf/sample.pdf? Also @EJP I know that the latter method is probably preferred, was just trying to show him that there are multiple ways of doing something. — Josh M, Oct 11 '13 at 03:41
I tried all combinations: `download2("http://pdfobject.com/markup/examples/full-browser-window.html", new File("C:\\Users\\Owner\\Desktop\\temporary\\myFile2.pdf"));` and `...myFile2.html` and `myFile2` - - - no result — CHEBURASHKA, Oct 11 '13 at 03:43
@JoshM The latter method is not just 'probably preferred'. The former method just wastes time and space, and assumes the file fits into available memory. — user207421, Oct 11 '13 at 03:43
@CHEBURASHKA As posted in my previous answer, change the URL to http://pdfobject.com/pdf/sample.pdf and EJP True :P My bad. — Josh M, Oct 11 '13 at 03:45
@LuiggiMendoza My answer is correct, he modified the URL causing it to break. :\ — Josh M, Oct 11 '13 at 03:51
LuiggiMendoza @Josh M I apologize for inconvenience ... This is my fault ... I confused the URLs — CHEBURASHKA, Oct 11 '13 at 03:54

score 3 · Accepted Answer · answered Oct 11 '13 at 03:42

3

You are downloading a .html URL which contains a referenced PDF as an embedded object. Java doesn't process that, unlike a browser, so you are saving the HTML, not the PDF. Have a look inside. For your assistance, here it is:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>Embedding a PDF using static HTML markup: Full-browser window (100% width/height)</title>

<!-- This example created for PDFObject.com by Philip Hutchison (www.pipwerks.com) -->

<style type="text/css">
<!--

html {
   height: 100%;
}

body {
   margin: 0;
   padding: 0;
   height: 100%;
}

p {
   padding: 1em;
}

object {
   display: block;
}

-->
</style>

</head>

<body>

<object data="/pdf/sample.pdf#toolbar=1&amp;navpanes=0&amp;scrollbar=1&amp;page=1&amp;view=FitH" 
        type="application/pdf" 
        width="100%" 
        height="100%">

<p>It appears you don't have a PDF plugin for this browser. No biggie... you can <a href="/pdf/sample.pdf">click here to download the PDF file.</a></p>

</object>

</body>
</html>

answered Oct 11 '13 at 03:42

user207421

305,947
44
307
483

Should I change the file's extension? `...myFile2.xxx` ? – CHEBURASHKA Oct 11 '13 at 03:44
Questions that start 'so are you saying' almost invariably contain to an invalid inference. I am saying what I said. To expand on what I did say, as opposed to your wild misinterpretation, you have no reason to expect downloading an HTML file containing a reference to a PDF file to yield the PDF file as output, only the HTML file. – user207421 Oct 11 '13 at 03:45
I am sorry ... Thank you for comments. Actually i did look inside ... i wanted to save it as pdf. "Java does not process that..." that's fine ... so there is no solution for it – CHEBURASHKA Oct 11 '13 at 03:49
1

The solution is to use a URL that delivers the actual PDF, i.e. in this case "http://pdfobject.com/pdf/sample.pdf#toolbar=1&navpanes=0&scrollbar=1&page=1&view=FitH". – user207421 Oct 11 '13 at 03:50
1

Use the URL `http://pdfobject.com/pdf/sample.pdf` to download the file. – Santosh Oct 11 '13 at 03:51

Why a downloaded file can get corrupted?

1 Answers1

Linked