Downloading a PDF file from a protected webpage

Question

So I've been trying this for a couple of days now and I really don't have any time left since the project is due in tomorrow. I was wondering if someone could help me out with this. I'm trying to download a PDF file from this link, which is a link to a webpage of PDF content. I have tried using Jsoup but Jsoup does not support webpages when they are written in PDF format. This is the code I've been trying to use:

    System.out.println("opening connection");
    URL url = new URL("https://www.capitaliq.com/CIQDotNet/Filings/DocumentRedirector.axd?versionId=1257051021&type=pdf&forcedownload=false");
    InputStream in = url.openStream();
    FileOutputStream fos = new FileOutputStream("/Users/HIDDEN/Desktop/fullreport.pdf");

    System.out.println("reading file...");
    int length = -1;
    byte[] buffer = new byte[1024];// buffer for portion of data from
    // connection
    while ((length = in.read(buffer)) > -1) {
        fos.write(buffer, 0, length);
    }
    fos.close();
    in.close();
    System.out.println("file was downloaded");

The problem with this code is that it automatically redirects you to a login page in which you have to type your username and password. Therefore, I have to find a way to login to my account and connect to the page without using Jsoup (as earlier mentioned, this is unable to read PDF contents). If someone could alter this code to make it possible for me to login and subsequently download the pdf by looking at the html of this login page and adjusting the code. I would be eternally grateful. Thank you!

we are not here to do bang out code for you. in short, you have to replicate whatever it is that's happening in the browser. if it's a form-based login, you have to replicate that form submission, capture any relevant cookies/auth headers, and use those in the request to grab the pdf. — Marc B, Aug 11 '15 at 16:26
I know this. My question is how one is supposed to do this without Jaunt or Jsoup... — Serpemes, Aug 11 '15 at 16:33
I disagree, there's already APIs out there that will do this for you. Don't reinvent the wheel. — roundar, Aug 11 '15 at 16:36

score 0 · Answer 1 · edited May 23 '17 at 10:26

HtmlUnit is what I use for stuff like this, especially when speed is not critical.

Here's a random-ish piece of psuedo code from another one of my answers:

WebClient wc = new WebClient(BrowserVersion.CHROME);

HtmlPage p = wc.getPage(url)

((HtmlTextInput) p.getElementById(userNameId)).setText(userName);
((HtmlTextInput) p.getElementById(passId)).setText(pass);

p = ((HtmlElement) p.getElementById(submitBtnId)).click();

// Just as an example for something I've had to do, I use
// UnexpectedPage when the "content-type" is "application/zip"
UnexpectedPage up = ((HtmlElement) p.getElementById(downloadBtn)).click();

InputStream in = up.getInputStream();

...

Use another library for reading the pdf

Downloading a PDF file from a protected webpage

1 Answers1