How to save a .pdf from a browser?

Question

I tried to save .pdf file using different methods I found on stackoverflow including FileUtils IO, however, I would always get it damaged. As I opened the damaged file using a notepad I got the following stuff:

<HEAD>

    <TITLE>
        09010b129fasdf558a-
    </TITLE>

</HEAD>


<HTML>

<SCRIPT language="javascript" src="./js/windowClose.js"></SCRIPT>

<LINK href="./theme/default.css" rel="stylesheet" type="text/css">
<LINK href="./theme/additions.css" rel="stylesheet" type="text/css">

<BODY leftmargin="0" topmargin="0">

<TABLE cellpadding="0" cellspacing="0" width="100%">
    <TR>
        <TD class="mainSectionHeader">
            <A href="javascript:windowClose()" class="allLinks">
                CLOSE
            </A>
        </TD>

    </TR>

</TABLE>

                <script language='javaScript'>
                    alert('Session timed out. Please login again.\n');
                    window.close();
                </script>



</BODY>

</HTML>

Later, I tried to save a .pdf file from a browser using the answer provided by @BalusC. This solution is very helpful: I was able to get rid of the session issues. However, it also produces a damaged .pdf. But as I open it with a notepad, it is completely different. There are no login issues anymore though:

<HTML>

    <HEAD>

        <TITLE>
            Evidence System
        </TITLE>

    </HEAD>

<LINK href="./theme/default.css" rel="stylesheet" type="text/css">

<TABLE cellpadding="0" cellspacing="0" class="tableWidth760" align="center">
    <TR>
        <TD class="headerTextCtr">
            Evidence System
        </TD>
    </TR>
    <TR>
        <TD colspan="2">
            <HR size="1" noshade>
        </TD>
    </TR>
    <TR>
        <TD colspan="2">



<HTML>
<HEAD>
<link href="./theme/default.css" rel="stylesheet" type="text/css">
<script language="JavaScript">

function trim(str)
{
    var trmd_str

    if(str != "")
    {
        trmd_str = str.replace(/\s*/, "")
        if (trmd_str != ""){

            trmd_str = trmd_str.replace(/\s*$/, "")
        }

    }else{
        trmd_str = str
    }
    return trmd_str
}  

function validate(frm){
    //check for User name 
    var msg="";
    if(trim(frm.userName.value)==""){
        msg += "Please enter your user id.\n";
        frm.userName.focus();
    }

    if(trim(frm.password.value)==""){
        msg += "Please enter your password.\n";
        frm.userName.focus();
    }

    if (trim(msg)==""){
        frm.submit();
    }else{
        alert(msg);
    }
}

function numCheck(event,frm){
    if( event.keyCode == 13){
            validate(frm);  
    }
}

</script>
</HEAD>

<BODY onLoad="document.frmLogin.userName.focus();">

<FORM name='frmLogin' method='post' action='./ServletVerify'>
    <TABLE width="100%" cellspacing="20">
        <tr>
            <td class="mainTextRt">
                Username
                <input type="text" name="userName" maxlength="32" tabindex="1" value="" 
                onKeyPress="numCheck(event,this.form)" class="formTextField120">
            </TD>
            <td class="mainTextLt">
                Password
                <input type="password" name="password" maxlength="32" tabindex="2" value="" 
                onKeyPress="numCheck(event,this.form)" class="formTextField120">
            </TD>
        </TR>

        <tr>                    
            <td colspan="2" class="mainTextCtr" style="color:red">
                Unknown Error
            </td>
        </tr>

        <tr>
            <td colspan="2" class="mainTextCtr">
                <input type="button" tabindex="3" value="Submit" onclick="validate(this.form)" >
            </TD>
        </TR>
    </TABLE>

    <INPUT TYPE="hidden" NAME="actionFlag" VALUE="inbox">
</FORM>

</BODY>
</HTML>

        </TD>
    </TR>
    <TR>
        <TD height="2"></TD>
    </TR>
    <TR>
        <TD colspan="2">
            <HR size="1" noshade>
        </TD>
    </TR>
    <TR>
        <TD colspan="2">
            <LINK href="./theme/default.css" rel="stylesheet" type="text/css">

<TABLE width="80%" align="center" cellspacing="0" cellpadding="0">
    <TR>
        <TD class="footerSubtext">
            Evidence Management System
        </TD>
    </TR>

    <!-- For development builds, change the date accordingly when sending EAR files out to Wal-Mart -->
    <TR>
        <TD class="footerSubtext">
            Build:&nbsp;&nbsp;v3.1
        </TD>
    </TR>

</TABLE>
        </TD>
    </TR>
</TABLE>

</HTML>

What other options do I have?

PS: When I try to save the file manually using CTRL+Shift+S , the file gets saved OK.

edited my answer Buras.. i think the problem might be the location. It attempts to be saving an html file as opposed to the pdf. See the **EDIT** section which would explain the "this is not a binary file" as well as the reader thinking it's "damaged" — ddavison, Sep 30 '13 at 15:00

score 3 · Answer 1 · edited May 23 '17 at 12:13

A PDF is considered a Binary File and it gets corrupted because the way that copyUrlToFile() works. By the way, this looks like a duplicate of JAVA - Download Binary File (e.g. PDF) file from Webserver

Try this custom binary download method out -

public void downloadBinaryFile(String path) {
    URL u = new URL(path);
    URLConnection uc = u.openConnection();
    String contentType = uc.getContentType();
    int contentLength = uc.getContentLength();
    if (contentType.startsWith("text/") || contentLength == -1) {
      throw new IOException("This is not a binary file.");
    }
    InputStream raw = uc.getInputStream();
    InputStream in = new BufferedInputStream(raw);
    byte[] data = new byte[contentLength];
    int bytesRead = 0;
    int offset = 0;
    while (offset < contentLength) {
      bytesRead = in.read(data, offset, data.length - offset);
      if (bytesRead == -1)
        break;
      offset += bytesRead;
    }
    in.close();

    if (offset != contentLength) {
      throw new IOException("Only read " + offset + " bytes; Expected " + contentLength + " bytes");
    }

    String filename = u.getFile().substring(filename.lastIndexOf('/') + 1);
    FileOutputStream out = new FileOutputStream(filename);
    out.write(data);
    out.flush();
    out.close();
}

EDIT: It actually sounds as if you are not on the page that you think you are.. instead of doing driver.getCurrentUrl()

Have your script take the Url from the link to the PDF. Assuming there is a link like <a href='http://mysite.com/my.pdf' /> Instead of clicking it, then getting the url, just take the href from that link, and download it.

String pdfPath = driver.findElement(By.id("someId")).getAttribute("href");
downloadBinaryFile(pdfPath);

It appears that `contentLength == -1 ...=> throw new IOException("This is not a binary file.");` — Buras, Sep 27 '13 at 21:54
Thank You for the update. I have double checked using `system.out.print` : **the driver is on the right URL**. Also, the direct referral to the `href` throws `java.net.MalformedURLException: unknown protocol: javascript` because the `href="javascript: viewDocument ('0901asdasd09309d', '093094defkjhsdf', '23423432', 'General', '-pleasae select', 'search', '')"` — Buras, Sep 30 '13 at 15:47
if the href is javascript, then you need to go into the `viewDocument` method and pull the URL and do some string manipulation to fetch your url. If you give me a direct URL, I can solve this right now — ddavison, Sep 30 '13 at 17:11
The website i am using is provided by a third party vendor to our company. I can provide the URL, however, it is not going to work anywhere unless one is connected to our company's network with the company's laptop. I think that the website itself is problematic. — Buras, Sep 30 '13 at 17:36

score 3 · Accepted Answer · answered Sep 30 '13 at 15:43

3

From the errorneous response which appears to be just a HTML error page:

alert('Session timed out. Please login again.\n');

It thus appears that downloading the PDF file is required to take place in a valid HTTP session. The HTTP session is backed by a cookie. The HTTP session in turn contains in the server side usually information about the currenty active and/or logged-in user.

The Selenium web driver manages cookies all by itself fully transparently. You can obtain them programmatically as follows:

Set<Cookie> cookies = driver.manage().getCookies();

When manually fiddling with java.net.URL outside control of Selenium, you should be making sure yourself that the URL connection is using the same cookies (and thus also maintaining the same HTTP session). You can set cookies on the URL connection as follows:

URLConnection connection = new URL(driver.getCurrentUrl()).openConnection();

for (Cookie cookie : driver.manage().getCookies()) {
    String cookieHeader = cookie.getName() + "=" + cookie.getValue();
    connection.addRequestProperty("Cookie", cookieHeader);
}

InputStream input = connection.getInputStream(); // Write this to file.

answered Sep 30 '13 at 15:43

BalusC

1,082,665
372
3,610
3,555

Thanks a lot. I actually did have issues with the coockies. Now as I save the .pdf it is still damaged, however at least there are no session issues. I have posted the updated version of the damaged file that I opened using a notepad – Buras Sep 30 '13 at 16:42
Does the `driver.getCurrentUrl()` return the URL of the PDF file or the webpage itself? I already found it suspicious, but specifying the wrong URL is in first place a too obvious mistake that I just ignored it. – BalusC Sep 30 '13 at 16:43
Yes, it does. I have doublechecked it with the system.out.print – Buras Sep 30 '13 at 16:44
Well, apart from the wrong URL, I'm not seeing any probable cause in the information provided so far. By the way, it is not so nice to edit and change the question in such way that the answer becomes completely confusing and useless. I was referring the session timeout message in your question which is now nowhere visible. Any future reader who didn't read the initial version of your question wouldn't have any idea what I was talking about. This is not appreciated. If you don't fix the question, I will delete this answer. – BalusC Sep 30 '13 at 16:45
I am sorry, I will fix it back – Buras Sep 30 '13 at 16:48
With further inspection, it appears that a HTML page representing a login form is been returned. Apparently tracking/maintaining the logged-in user has somehow failed, in spite of the session being maintained. This is not something which I (or anyone else) can answer without knowing the URL and the login credentials of the website in question. Is that website maintained by you or someone else? It might be worth the effort to ask the developer responsible for that website for the procedure to programmatically download a PDF file from there. Perhaps you need additional request parameters or so. – BalusC Sep 30 '13 at 17:04
The website i am using is provided by a third party vendor to our company. I can provide the URL, however, it is not going to work anywhere unless one is connected to our company's network with the company's laptop. I think that the website itself is problematic. – Buras Sep 30 '13 at 17:36
As first step, I'd open a normal webbroser, press F12 (Chrome/FireFox>=23/IE>=9) and keep track of the HTTP traffic (HTTP request params, headers, cookies, post data, etc) during the entire process of the path to downloading the PDF file and then recreate exactly that HTTP traffic using `URLConnection`. Surely it must be possible programmatically (you see, a webbrowser is by itself also just a piece of software, perhaps it's after all just your lack of basic knowledge of HTTP; the whole session cookie problem is evidence therefor). – BalusC Sep 30 '13 at 17:38

score 2 · Answer 3 · edited May 23 '17 at 11:50

The server may be compressing the pdf. You can use this code, stolen from this answer to detect and decompress the response from the server,

InputStream is = driver.getCurrentUrl().openStream();
try {
   InputStream decoded = decompressStream(is);
   FileOutputStream output = new FileOutputStream(
       new File("C:\\Users\\myDocs\\myfolder\\myFile.pdf"));
   try {
       IOUtils.copy(decoded, output);
   }
   finally {
       output.close();
   }
} finally {
   is.close();
}

public static InputStream decompressStream(InputStream input) {
     PushBackInputStream pb = new PushBackInputStream( input, 2 ); //we need a pushbackstream to look ahead
     byte [] signature = new byte[2];
     pb.read( signature ); //read the signature
     pb.unread( signature ); //push back the signature to the stream
     if( signature[ 0 ] == (byte) 0x1f && signature[ 1 ] == (byte) 0x8b ) //check if matches standard gzip maguc number
       return new GZIPInputStream( pb );
     else 
       return pb;
}

Thank You a lot. It does save the file to the destination, however, as i try to open the file, `Adobe` says that the file is damaged. — Buras, Sep 30 '13 at 13:53

score 1 · Answer 4 · edited May 23 '17 at 12:29

When I try to save the file manually using CTRL+Shift+S , the file gets saved OK.

While I advocate using Java to download the file, there is a not-so-recommended workaround that presses Ctrl+Shift+S programatically: The Robot class.

It sucks to use a workaround, but it works reliably as far as I can tell in the browsers and OSes I tried. This code should not get into any serious application. But it's OK for tests if you won't be able to solve your issue the right way.

Robot robot = new Robot();

Press Ctrl+Shift+S

robot.keyPress(KeyEvent.VK_CONTROL);
robot.keyPress(KeyEvent.VK_SHIFT);
robot.keyPress(KeyEvent.VK_S);
robot.keyRelease(KeyEvent.VK_S);
robot.keyRelease(KeyEvent.VK_SHIFT);
robot.keyRelease(KeyEvent.VK_CONTROL);

In browsers and OSes I know, you should be in the Save file dialogue in the File name input. You can type in your absolute path:

robot.keyPress(KeyEvent.VK_C);        // C
robot.keyRelease(KeyEvent.VK_C);
robot.keyPress(KeyEvent.VK_COLON);    // : (colon)
robot.keyRelease(KeyEvent.VK_COLON);
robot.keyPress(KeyEvent.VK_SLASH);    // / (slash)
robot.keyRelease(KeyEvent.VK_SLASH);
// etc. for the whole file path

robot.keyPress(KeyEvent.VK_ENTER);    // confirm by pressing Enter in the end
robot.keyRelease(KeyEvent.VK_ENTER);

To get the keycodes, you can use KeyEvent#getExtendedKeyCodeForChar() (Java 7+ only), or How can I make Robot type a `:`? and Convert String to KeyEvents.

How to save a .pdf from a browser?

4 Answers4

Linked