How to download a pdf file programmatically from a webpage with .html extension?

Question

I have reviewed ALL similar questions (not only this!) on this forum and have tried ALL of those methods however still was not able to programmatically download a test file: http://pdfobject.com/markup/examples/full-browser-window.html

The following is the direct link to the test file that i am trying to download. This is a test pdf file with an open access, so anybody can use it to test a download method.

How can I download this particular file so that it has a pdf extension?

You can do what those answer say. Show us what you tried and why it failed. — Sotirios Delimanolis, Oct 11 '13 at 02:29
What specific problem are you having? It sounds like you're just saying "It didn't work" - are you getting errors? crashes? something else? — Krease, Oct 11 '13 at 02:30
Thank you for replies. I tried all methods that I found, but there is always an error, for example `contentLenght = -1`. I will update my question with the code of one of my tries, however it will take a lot of space. Here is the update (see above) — CHEBURASHKA, Oct 11 '13 at 02:33
You download it as if you were downloading any other type of file, the extension is hardly relevant. — Josh M, Oct 11 '13 at 02:33
I tried the very common method that works for anything but pdf: `org.apache.commons.io.FileUtils.copyURLToFile(driver.getCurrentUrl(), "C:\\Users\...........myfile.pdf");` ... got an exception — CHEBURASHKA, Oct 11 '13 at 02:35
First of all why are you using a `WebDriver`? Then, do you can an exception? — Sotirios Delimanolis, Oct 11 '13 at 02:35
Because i am using Selenium ...... i tagged my question with `selenium ` — CHEBURASHKA, Oct 11 '13 at 02:36
Yes I did. I tried ... I do not know why although the file gets saved, it is damaged ... and cannot be opened — CHEBURASHKA, Oct 11 '13 at 02:51
Your code doesn't bear any resemblance to any of the numerous existing correct answers to this question. — user207421, Oct 11 '13 at 03:09
you did also see [this question](http://stackoverflow.com/questions/19059769/how-to-save-a-pdf-from-a-browser) correct? It seems to be an exact duplicate. — ddavison, Oct 14 '13 at 22:04

Josh M · Accepted Answer · 2013-10-11T03:54:59.197

4

For downloading a file, perhaps you could try something like this:

import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.net.URL;
import java.net.URLConnection;

public final class FileDownloader {

    private FileDownloader(){}

    public static void main(String args[]) throws IOException{
        download("http://pdfobject.com/pdf/sample.pdf", new File("sample.pdf"));
    }

    public static void download(final String url, final File destination) throws IOException {
        final URLConnection connection = new URL(url).openConnection();
        connection.setConnectTimeout(60000);
        connection.setReadTimeout(60000);
        connection.addRequestProperty("User-Agent", "Mozilla/5.0");
        final FileOutputStream output = new FileOutputStream(destination, false);
        final byte[] buffer = new byte[2048];
        int read;
        final InputStream input = connection.getInputStream();
        while((read = input.read(buffer)) > -1)
            output.write(buffer, 0, read);
        output.flush();
        output.close();
        input.close();
    }
}

edited Oct 11 '13 at 03:54

answered Oct 11 '13 at 02:41

Josh M

11,611
7
39
49

+1 Thank you very much ... I did not know about this method, however it gives the same type of error that other method do: `although the file gets saved, is is damaged and could not be opened` – CHEBURASHKA Oct 11 '13 at 02:50
That's weird, because it works for me: http://puu.sh/4N0S6.png Perhaps your PDF reader is corrupt. – Josh M Oct 11 '13 at 02:52
You are most probably right ... would you kindly include the `import ...` lines ... maybe i have confused something Thanks – CHEBURASHKA Oct 11 '13 at 02:57
This is weird ... I thought that I got the wrong imports ... but on the contrary my imports are OK ... and Adobe reader is also fine ... I do not know why it wouldn't open it – CHEBURASHKA Oct 11 '13 at 03:03
@CHEBURASHKA Tested both methods again, both worked for me. Just a note, if you are using `download()`, it might append the bytes, try using `download2()` – Josh M Oct 11 '13 at 03:05
Thank you very much ... i believe you that the method works ... i will try to do something with my adobe reader ... it is the only possible bug THANK YOU!!! – CHEBURASHKA Oct 11 '13 at 03:07
@CHEBURASHKA See edit, `download()` now works if the file already exists because now it overwrites the bytes, as opposed to appending them. – Josh M Oct 11 '13 at 03:10
Just to repeat my comment elsewhere that the version using a `ByteArrayOutputStream` is a pointless waste of time and space, and assumes that the entire file fits into memory. – user207421 Oct 11 '13 at 03:53
@EJP Thanks, removed the bad method :P – Josh M Oct 11 '13 at 03:55
Josh, I apologize for troubles ... this is my lack of knowledge. This is answer is great, if there is no trouble, would you mind explaining how did you figured out that the link is `http://pdfobject.com/pdf/sample.pdf`. The original link that I provided did not even contain the word *sample* `http://pdfobject.com/markup/examples/full-browser-window.html`? – CHEBURASHKA Oct 11 '13 at 04:00
When you inspect the element of http://pdfobject.com/markup/examples/full-browser-window.html you will notice that you get something like http://puu.sh/4N3Bm.png. I believe you could figure out the rest on your own (as it is fairly straightforward after looking at the screenshot) – Josh M Oct 11 '13 at 04:03
Thanks a lot!!! i apologize again for any inconvenience. +100 bounty in 2 days :) – CHEBURASHKA Oct 11 '13 at 04:37
what a block! http://stackoverflow.com/questions/19059769/how-to-save-a-pdf-from-a-browser/19060116#19060116 ! `:D` – ddavison Oct 14 '13 at 22:02

score 1 · Answer 2 · edited May 23 '17 at 12:24

Let me give you a shorter solution, it comes with a library called JSoup, which BalusC often uses in his answers.

//Get the response
Response response=Jsoup.connect(location).ignoreContentType(true).execute();

//Save the file 
FileOutputStream out = new FileOutputStream(new File(outputFolder + name));
out.write(response.bodyAsBytes());
out.close();

Well, you must have guessed by now, response.body() is where the pdf is. You can download any binary file with this piece of code.

How to download a pdf file programmatically from a webpage with .html extension?

2 Answers2

Linked

Related