8

I need to download a pdf file from a webserver to my pc and save it locally.

I used Httpclient to connect to webserver and get the content body:

HttpEntity entity=response.getEntity();
                InputStream in=entity.getContent();

                String stream = CharStreams.toString(new InputStreamReader(in));
                int size=stream.length();
                System.out.println("stringa html page LENGTH:"+stream.length());
                 System.out.println(stream);
                 SaveToFile(stream);

Then i save content in a file:

                              //check CRLF (i don't know if i need to to this)
                                   String[] fix=stream.split("\r\n");

                                      File file=new              File("C:\\Users\\augusto\\Desktop\\progetti web\\test\\test2.pdf");
                                      PrintWriter out = new PrintWriter(new FileWriter(file));
                                      for (int i = 0; i < fix.length; i++)  {
                                          out.print(fix[i]);
                                         out.print("\n");

                                      }
                                     out.close();

I also tried to save a String content to file directly:

                         OutputStream out=new FileOutputStream("pathPdfFile");
                         out.write(stream.getBytes());
                         out.close();

But the result is always the same: I can open pdf file but i can see white pages only. Does the mistake is around pdf stream and endstream charset encoding? Does pdf content between stream and endStream need to be manipulate in some others way?


Hope this helps to avoid some misunderstanding about what i want to do:

This is my login (works perfectly):

  public static void postForm(){
    String cookie="";
    try {
   System.out.println("POSTFORM ###################################");
     String postURL = "http://login.libero.it/logincheck.php";
    HttpPost post = new HttpPost(postURL);
        post.setHeader("User-Agent", "Chrome/14.0.835.202");
        post.setHeader("Referer","http://login.libero.it/?layout=m&service_id=m_mail&ret_url=http://m.mailbeta.libero.it/m/wmm/auth/check");
        if(cookieVector.size()>0){
           for(int i=0;i<cookieVector.size();i++){
              cookie=cookie+cookieVector.elementAt(i).toString().replace("Set-Cookie:", "")+";";

             }
              post.setHeader("Cookie",cookie);

        }
        //System.out.println("sequenza cookie post:"+cookie);
        List<NameValuePair> params = new ArrayList<NameValuePair>();
        params.add(new BasicNameValuePair("SERVICE_ID", "m_mail"));
        params.add(new BasicNameValuePair("LAYOUT", "m"));
        params.add(new BasicNameValuePair("DEVICE", ""));
        params.add(new  BasicNameValuePair("RET_URL","http://m.mailbeta.libero.it/m/wmm/auth/check"));
        params.add(new BasicNameValuePair("LOGINID", "secret"));
        params.add(new BasicNameValuePair("PASSWORD", "secret"));
        UrlEncodedFormEntity ent = new UrlEncodedFormEntity(params,HTTP.UTF_8);
        System.out.println("stringa urlPost:"+ent.toString());
        post.setEntity(ent);
        HttpResponse responsePOST = client.execute(post);
                System.out.println("Response postForm: " +              responsePOST.getStatusLine());
        Header[] allHeaders = responsePOST.getAllHeaders();

    String location = "";
    for (Header header : allHeaders) {
        if("location".equalsIgnoreCase(header.getName())) location = header.getValue();
        responsePOST.addHeader(header.getName(), header.getValue());
    }
    cookieVector.clear();
    Header[] headerx=responsePOST.getHeaders("Set-Cookie");
    System.out.println("array header:"+headerx.length);
        for(int i=0;i<headerx.length;i++){
             System.out.println("restituito cookie POST:"+headerx[i].getValue());
           cookieVector.add(headerx[i]);
           //System.out.println("cookie trovato POST:"+cookieVector.elementAt(i));
        }
        //System.out.println("inseriti"+cookieVector.size()+""+"elements");
        //HttpEntity resEntity = responsePOST.getEntity();

        // populate redirect information in response
         //CONTROLLO ESITO LOGIN
                     if(location.contains("https://login.libero.it/logincheck.php")){
                          loginError=1;
                     }
                 System.out.println("Redirecting to: " + location);
                 //EntityUtils.consume(resEntity);
                                 responsePOST.getEntity().consumeContent();
                 System.out.println("torno a GET:"+"url:"+location+"cookieVector size:"+cookieVector.size());
                 get(location,"http://login.libero.it/logincheck.php");




    }  catch (IOException ex) {
        Logger.getLogger(LiberoLoginNew.class.getName()).log(Level.SEVERE, null, ex);
    }

}

Once logged i'm able to access to the file's link (pdf,image,doc, exc.). In this case we take for example a pdf file:

    public static void httpConnection(String url,String referer,String cookieAuth){
    try {
        String location="";
        String cookie="";
        HttpResponse response;
        HttpGet get;
        HttpEntity respEntity;
        Referer=referer;
        System.out.println("HTTPCONNECTION ################################");
        System.out.println("connessione a:"+url+"............");

        get = new HttpGet(url);
        if(referer.length()>0){
        //httpget.setHeader("Referer",referer );

        }
           if(attachmentURL.size()==0){
            get.setHeader("User-Agent", "Chrome/14.0.835.202");
           }else{

           get.setHeader("Accept-charset", "UTF-8");

             get.setHeader("Content-type", "application/pdf");
           }
        if(cookieVector.size()>0){
            System.out.println("iserisco cookie da vector");
         for(int i=0;i<cookieVector.size();i++){
           cookie=cookie+cookieVector.elementAt(i).toString().replace("Set-Cookie:", "")+";";
          }
         get.setHeader("Cookie", cookie);
        }else if(cookieAuth.length()>0){
            System.out.println("inserisco cookieAuth....");
            System.out.println("valore cookieSession:"+cookieAuth);
            get.setHeader("Cookie",cookieAuth.replace("Set-Cookie:", "")+";");
        }

        response = client.execute(get);
        cookieVector.clear();//reset cookie


        System.out.println("home get: " + response.getStatusLine());


        Header[] headery=response.getAllHeaders();
         for(int j=0;j<headery.length;j++){
                            System.out.println(headery[j].getName()+" "+" VALUE:"+" "+headery[j].getValue());
         }
        Header[] headerx=response.getHeaders("Set-Cookie");
        System.out.println("array header:"+headerx.length);
          System.out.print("httpconnection SERVER HEADERS ###############");
        for(int i=0;i<headerx.length;i++){
             if("location".equalsIgnoreCase(headerx[i].getName())){
                 location = headerx[i].getValue();
                  //ResponseGET.addHeader(headerx[i].getName(), header.getValue());
             }

        //System.out.println(headerx[i].getValue());
        cookieVector.add(headerx[i]);
        }


              //STREAM CONTENT BODY

                HttpEntity entity2=response.getEntity();
                InputStream in=entity2.getContent(); <==THIS IS THE WAY I GET STREAM RESPONSE


               if(attachmentURL.size()>0){
                   saveAttachment(in);//SAVE FILE <==
               }else{
                from(in,htmlpage);//Parse and grab: message title,subject,attachments. If attachments are found then come back here and execute the method saveAttachment.
                in.close();
               }

    } catch (IOException ex) {
        Logger.getLogger(LiberoLoginNew.class.getName()).log(Level.SEVERE, null, ex);
    }

}

Method httpConnection works and i can download the file!!

Server Response:

 Date  VALUE: Fri, 18 Nov 2011 13:09:46 GMT
 Server  VALUE: Apache/2.2.21 (Unix) mod_jk/1.2.23
  Set-Cookie  VALUE: MST_PVP=tiQZO3nbl9_5f_OQXtJP32YiqQx_5f_kSh6F6Io7r3xS;       Domain=m.libero.it; Path=/
  Content-Type  VALUE: application/octet-stream
  Expires  VALUE: Fri, 18 Nov 2011 15:09:46 GMT
  Transfer-Encoding  VALUE: chunked

Example of response body:

 %PDF-1.7

 1 0 obj  % entry point
 <<
/Type /Catalog
/Pages 2 0 R

> endobj

 2 0 obj
 <<
 /Type /Pages
 /MediaBox [ 0 0 200 200 ]
 /Count 1
 /Kids [ 3 0 R ]
 >>
  endobj

  3 0 obj
  <<
 /Type /Page
 /Parent 2 0 R
 /Resources <<
  /Font <<
  /F1 4 0 R 
>>
>>
/Contents 5 0 R
>>
endobj

4 0 obj
<<
/Type /Font
/Subtype /Type1
/BaseFont /Times-Roman
>>
endobj

5 0 obj  % page content
<<
 /Length 44
 >>
 stream
  BT
  70 50 TD
 /F1 12 Tf
 (Hello, world!) Tj
  ET
  endstream
  endobj

  xref
  0 6
 0000000000 65535 f 
 0000000010 00000 n 
 0000000079 00000 n 
 0000000173 00000 n 
 0000000301 00000 n 
0000000380 00000 n 
trailer
<<
/Size 6
/Root 1 0 R
 >>
 startxref
 492
 %%EOF

Now,let starts from here. Can you,please, tell me what i have to do to save the stream in a file?

########### SOLVED:

To save a file locally from the Stream data, respecting the binary data nature, i did like this:

  public void saveFile(InputStream is){

   try {
        DataOutputStream out = new DataOutputStream(new  BufferedOutputStream(new FileOutputStream(new File("test.pdf"))));
        int c;
        while((c = is.read()) != -1) {
            out.writeByte(c);
        }
        out.close();
                    is.close();
    }catch(IOException e) {
        System.err.println("Error Writing/Reading Streams.");
    }
     }

If you want a more efficent method you can use java.IOUtils and do like this:

   public void saveFile(InputStream is){

      OutputStream os=new FileOutputStream(new File("test.pdf"));        
      byte[] bytes = IOUtils.toByteArray(is);
      os.write(bytes);
      os.close();

    }
Augusto Picciani
  • 788
  • 2
  • 11
  • 31
  • possible duplicate of [How to download and save a file from internet using Java](http://stackoverflow.com/questions/921262/how-to-download-and-save-a-file-from-internet-using-java) – Greg Mattes Nov 17 '11 at 18:07
  • 1
    The 'more efficient' method is crap, because it stores the whole file in memory! Try it with a 2 GB file :) ... Less code != more efficient. However, I'm glad you solved your issue!! Good job! – gd1 Nov 19 '11 at 15:22
  • gd1,thanks for your comment. "it stores the whole file in memory! ",yes it's true. But depend of applications. In this case we are talking about email attachments and cases where attachment size is more than 10 MB are rare. :) However i appreciate your tip,thanks again ! :) – Augusto Picciani Nov 19 '11 at 16:23

4 Answers4

9

Never store binary data into a String.

Never use PrintWriter for binary data.

Never write binary files line by line.

I don't want to be harsh or impolite but these three never's have to take roots in your mind! :)

You can see this page for an example on how to download a binary file. I don't like this example because it caches the whole document in memory (what happens if its size is 5GB?) but you can start from this. :)

gd1
  • 11,300
  • 7
  • 49
  • 88
  • 1
    I tried your example (i had already tried some days ago) but nada. In this case pdf file does not open. – Augusto Picciani Nov 17 '11 at 22:29
  • You shouldn't try copy and paste pieces of code written by other people hoping you've randomly found the correct combination. Once you have understood the problem (downloading a binary file, not a text one) you should use the examples AMONG WITH the Java documentation in order to find a solution that is both correct and tailored to your needs, but however written by you. Write some code line by line, debug each single line, and create for us a SSCE (http://sscce.org/) – gd1 Nov 17 '11 at 22:59
  • So don't try the example you find on the Internet and the one provided by hurtledown INTO your program, but in an appropriate, separate test case, and really show us WHERE and HOW it fails. More, please learn about Java byte streams, because if you find acceptable to write a binary file line by line, then you'll have more and more problems even after you've succeeded in making your program work in some awkward way. Fix things up! – gd1 Nov 17 '11 at 23:06
  • Obviously, i adapted example code to my code, i'm not a super-dummy!!!.. :)) I had already tested other script using urlConnection library(like hartleMan example) to download other pdf files on others webservers that doesn't need a log in, and everythings was fine.(pdf was open successfully) I would try your and hartleman examples in a separate test case but i can't because to reproducing real test i need to first log into specific webserver and then downloading pdf file. But manage cookies with UrlConnection is so hard! – Augusto Picciani Nov 18 '11 at 00:38
  • OK. So the problem cannot be solved here, because we have no idea on what your code can be messing up in the login part, and "I tried your example (i had already tried some days ago) but nada" is at least misleading, don't you think? You don't need S.O., you basically have to debug your code. If your login part does not work, then stop telling you've got a download PDF issue and concentrate on it. But you apparently have a download PDF issue, too, if you keep on thinking it's fine to use PrintWriter for them. :) – gd1 Nov 18 '11 at 06:18
  • gd1, can you take a look to a new answer i published? – Augusto Picciani Nov 18 '11 at 14:44
  • Sure we'll do, it looks complete and understandable. Give me some time. It's also possible others will look at it and solve the issue. ;-) – gd1 Nov 18 '11 at 15:39
7

Use apache FileUtils. I tried it with a small PDF and a JAR that was 60 meg. Works great!

import java.io.File;
import java.io.IOException;
import java.net.URL;
import org.apache.commons.io.FileUtils;

String uri = "http://localhost:8080/PMInstaller/f1.pdf";
URL url = new URL(uri);
File destination = new File("f1.pdf");
FileUtils.copyURLToFile(url, destination);
Gary Eberhart
  • 150
  • 1
  • 7
3

can't you just take the link?

public static void downloadFile(URL from, File to, boolean overwrite) throws Exception {
    if (to.exists()) {
        if (!overwrite)
            throw new Exception("File " + to.getAbsolutePath() + " exists already.");
        if (!to.delete())
            throw new Exception("Cannot delete the file " + to.getAbsolutePath() + ".");
    }

    int lengthTotal = 0;
    try {
        HttpURLConnection content = (HttpURLConnection) from.openConnection();
        lengthTotal = content.getContentLength();
    } catch (Exception e) {
        lengthTotal = -1;
    }

    int lengthSoFar = 0;
    InputStream is = from.openStream();
    FileOutputStream fos = new FileOutputStream(to);

    int lastUpdate = 0;
    int c;
    while ((c = is.read()) != -1) {
        fos.write(c);
    }

    is.close();
    fos.close();
}
hurtledown
  • 673
  • 6
  • 18
  • Reading byte by byte is crazy. However +1 because the overall intentions are good. – gd1 Nov 17 '11 at 18:03
  • you are right... this was developed for small files and to have a precise progressbar. Anyway believe it or not I recently compared it with the speed of downloading a file with nio, in which you just connect the two streams and it takes the same time... – hurtledown Nov 17 '11 at 18:04
  • Yeah, the network can be "crappier" than any unoptimized code we may write. Our machines get better whereas our networks get worse. You should try it with a 5 GB document on a 100Mbit LAN, and it will eventually make some difference... – gd1 Nov 17 '11 at 18:10
  • hurtledown, i have to log in to the webserver with a series of cookies first, and then i can download the file. Any suggestion? – Augusto Picciani Nov 17 '11 at 18:46
  • You are programming an HTTP robot. It will take some effort, since there are no one-liner hints for it. For the cookies: http://download.oracle.com/javase/tutorial/networking/cookies/index.html To login, you probably need to POST the credentials on a login page. See: http://bytestrike.blogspot.com/2008/05/java-inviare-dati-via-http-post.html – gd1 Nov 17 '11 at 19:32
  • I have already made login part in my code and it works like a charms. But now i'm blocked here, on pdf stream problem. I can't go ahead. I'm asking you all why when i download a pdf it doesn't open. If in the downloaded pdf there's not an encoding stuff like "flatDecode" (or others) between "stream" and "endstream" i'm able to open and view the file. But if there's a kind of encoding i could not. – Augusto Picciani Nov 17 '11 at 22:46
  • You are downloading text whereas you should download raw bytes. See: http://www.google.com/?q=difference+between+text+files+and+binary+files – gd1 Nov 17 '11 at 23:01
0

Let jsoup do the hard work for downloading response as bytes.

Response response= Jsoup.connect(location)
               .ignoreContentType(true)
               .userAgent("Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0")  
               .referrer("http://www.google.com")   
               .timeout(12000) 
               .execute();

Write the bytes using apache commons FileUtil.

FileUtils.writeByteArrayToFile(new File(path), response.bodyAsBytes());
Sorter
  • 9,704
  • 6
  • 64
  • 74