15

I will need to perform a massive download of files from my Web Application.

It is obviously expected to be a long-running action (it'll be used once-per-year[-per-customer]), so the time is not a problem (unless it hits some timeout, but I can handle that by creating some form of keepalive heartbeating). I know how to create an hidden iframe and use it with content-disposition: attachment to attempt to download the file instead of opening it inside the browser, and how to instance a client-server communication for drawing a progress meter;

The actual size of the download (and the number of files) is unknown, but for simplicity we can virtually consider it as 1GB, composed of 100 files, each 10MB.

Since this should be a one-click operation, my first thought was to group all the files, while reading them from the database, in a dynamically generated ZIP, then ask the user to save the ZIP.

The question is: what are the best practices, and what are the known drawbacks and traps, in creating a huge archive from multiple small byte arrays in a WebApp?

That can be randomly split into:

  • should each byte array be converted in a physical temp file, or can they be added to the ZIP in memory ?
  • if yes, I know I'll have to handle the possible equality of names (they can have the same name in different records in the database, but not inside the same file system nor ZIP): are there any other possible problems that come to mind (assuming the file system always has enough physical space) ?
  • since I can't rely on having enough RAM to perform the whole operation in memory, I guess the ZIP should be created and fed to the file system before being sent to the user; is there any way to do it differently (eg with websocket), like asking the user where to save the file, and then starting a constant flow of data from the server to client (Sci-Fi I guess) ?
  • any other related known problems or best practices that cross your mind would be greatly appreciated.
n00begon
  • 3,503
  • 3
  • 29
  • 42
Andrea Ligios
  • 49,480
  • 26
  • 114
  • 243
  • Once a year? Perhaps HTTP isn't the best way to do this. Depending on who your sending this to, I'd consider something like rsync. – Zutty May 16 '13 at 10:49
  • I totally agree with you. Unfortunately, this **must** be a feature of that webapp (that contains sensitive data, we don't even have the grant to read production database... let's hope the customer(s) never have problems with anything) that we can't discuss :/ – Andrea Ligios May 16 '13 at 11:59
  • are you using a certain framework for your webapp? – Marco Forberg May 16 '13 at 12:27
  • `Struts2 + Spring`... do you know something framework-specific ? I omitted it (even in tag) because i thought it was not relevant – Andrea Ligios May 16 '13 at 12:35
  • well in a project i worked on last year we had something quite similar: the user uploaded a file that got processed and at the end the user was presented a report. but sometimes this took several hours so putting the result in a session was not an option. we had a static reference to the worker thread in our action class. so when the user accessed that action we checked the status of the worker thread and depending on its status provided a) option to start b) progress info c) download-link. we used turbine but i think this should work with struts as well – Marco Forberg May 16 '13 at 12:48

3 Answers3

15

Kick-off example of a totally dynamic ZIP file created by streaming each BLOB from the database directly to the client's File System.

Tested with huge archives with the following performances:

  • Server disk space cost: 0 MegaBytes
  • Server RAM cost: ~ xx Megabytes. the memory consumption is not testable (or at least I don't know how to do it properly), because I got different, apparently random results from running the same routine multiple times (by using Runtime.getRuntime().freeMemory()) before, during and after the loop). However, the memory consumption is lower than using byte[], and that's enough.


FileStreamDto.java using InputStream instead of byte[]

public class FileStreamDto implements Serializable {
    @Getter @Setter private String filename;
    @Getter @Setter private InputStream inputStream; 
}


Java Servlet (or Struts2 Action)

/* Read the amount of data to be streamed from Database to File System,
   summing the size of all Oracle's BLOB, PostgreSQL's ABYTE etc: 
   SELECT sum(length(my_blob_field)) FROM my_table WHERE my_conditions
*/          
Long overallSize = getMyService().precalculateZipSize();

// Tell the browser is a ZIP
response.setContentType("application/zip"); 
// Tell the browser the filename, and that it needs to be downloaded instead of opened
response.addHeader("Content-Disposition", "attachment; filename=\"myArchive.zip\"");        
// Tell the browser the overall size, so it can show a realistic progressbar
response.setHeader("Content-Length", String.valueOf(overallSize));      

ServletOutputStream sos = response.getOutputStream();       
ZipOutputStream zos = new ZipOutputStream(sos);

// Set-up a list of filenames to prevent duplicate entries
HashSet<String> entries = new HashSet<String>();

/* Read all the ID from the interested records in the database, 
   to query them later for the streams: 
   SELECT my_id FROM my_table WHERE my_conditions */           
List<Long> allId = getMyService().loadAllId();

for (Long currentId : allId){
    /* Load the record relative to the current ID:         
       SELECT my_filename, my_blob_field FROM my_table WHERE my_id = :currentId            
       Use resultset.getBinaryStream("my_blob_field") while mapping the BLOB column */
    FileStreamDto fileStream = getMyService().loadFileStream(currentId);

    // Create a zipEntry with a non-duplicate filename, and add it to the ZipOutputStream
    ZipEntry zipEntry = new ZipEntry(getUniqueFileName(entries,fileStream.getFilename()));
    zos.putNextEntry(zipEntry);

    // Use Apache Commons to transfer the InputStream from the DB to the OutputStream
    // on the File System; at this moment, your file is ALREADY being downloaded and growing
    IOUtils.copy(fileStream.getInputStream(), zos);

    zos.flush();
    zos.closeEntry();

    fileStream.getInputStream().close();                    
}

zos.close();
sos.close();    


Helper method for handling duplicate entries

private String getUniqueFileName(HashSet<String> entries, String completeFileName){                         
    if (entries.contains(completeFileName)){                                                
        int extPos = completeFileName.lastIndexOf('.');
        String extension = extPos>0 ? completeFileName.substring(extPos) : "";          
        String partialFileName = extension.length()==0 ? completeFileName : completeFileName.substring(0,extPos);
        int x=1;
        while (entries.contains(completeFileName = partialFileName + "(" + x + ")" + extension))
            x++;
    } 
    entries.add(completeFileName);
    return completeFileName;
}



Thanks a lot @prunge for giving me the idea of the direct streaming.

Andrea Ligios
  • 49,480
  • 26
  • 114
  • 243
  • hi, I have almost the same use case of you, except that the byte stream is from remote host via http connection. I found that this cost lots of memory usage by the system, not the JVM. Do you have the same issue? – StrikeW Sep 24 '14 at 04:01
  • Nope, but I've commented your question – Andrea Ligios Sep 24 '14 at 10:35
  • Is there anyway to notify the client of an error in the middle of the streaming/zipping process ? Say in the middle of the for loop in this example – Anddo Oct 17 '21 at 12:29
  • 1
    That's already managed by the browser... if there's an error, the browser will tell you it cannot complete the download – Andrea Ligios Oct 17 '21 at 14:43
  • Sadly that didn't happen. Part of the zip file (with correct files) shows as done on the browser while the exception is thrown and logged on the backend. – Anddo Oct 17 '21 at 15:56
  • 1
    Well, that's up to the browser implementation I guess. I remember having created back then also a progress bar in the web app, that was polling a value in session fed by the thread that was generating the ZIP. I guess there are many alternatives, like queues for example. – Andrea Ligios Oct 17 '21 at 17:11
  • 1
    Yeah it might need custom monitoring from client-side. However, just FYI, terminating the servlet output stream with failure action causes the browser (chrome/ff/edge) to show download failure. This does NOT happen in tomcat/undertow except of tomcat version 9.0.53 > https://tomcat.apache.org/tomcat-9.0-doc/changelog.html related to the commit > https://github.com/apache/tomcat/commit/cf9abfc1decae26e61eefd89a54317ae8696be7b – Anddo Oct 18 '21 at 12:10
  • I managed to trace and do like spring in tomcat adapter to deal with native coyote response because there are exception handlers preventing the trigger of this action even though I don't like this. Also this commit included in latest Spring Boot 2.5.5 which we don't use so far. Sorry for long commit, thought to share what I found. Still will look for any leaks regarding this action though. – Anddo Oct 18 '21 at 12:13
10

For large content that won't fit in memory at once, stream the content from the database to the response.

This kind of thing is actually pretty simple. You don't need AJAX or websockets, it's possible to stream large file downloads through a simple link that the user clicks on. And modern browsers have decent download managers with their own progress bars - why reinvent the wheel?

If writing a servlet from scratch for this, get access to the database BLOB, getting its input stream and copy content through to the HTTP response output stream. If you have Apache Commons IO library, you can use IOUtils.copy(), otherwise you can do this yourself.

Creating a ZIP file on the fly can be done with a ZipOutputStream. Create one of these over the response output stream (from the servlet or whatever your framework gives you), then get each BLOB from the database, using putNextEntry() first and then streaming each BLOB as described before.

Potential Pitfalls/Issues:

  • Depending on the download size and network speed, the request might take a lot of time to complete. Firewalls, etc. can get in the way of this and terminate the request early.
  • Hopefully your users are on a decent corporate network when requesting these files. It would be far worse over remote/dodgey/mobile connections (if it drops out after downloading 1.9G of 2.0G, users have to start again).
  • It can put a bit of load on your server, especially compressing huge ZIP files. It might be worth turning compression down/off when creating the ZipOutputStream if this is a problem.
  • ZIP files over 2GB (or is that 4 GB) might have issues with some ZIP programs. I think the latest Java 7 uses ZIP64 extensions, so this version of Java will write the huge ZIP correctly but will the clients have programs that support the large zip files? I've definitely run into issues with these before, especially on old Solaris servers
Tigerware
  • 3,196
  • 2
  • 23
  • 39
prunge
  • 22,460
  • 3
  • 73
  • 80
  • 2
    The streaming from database worked like a charm ! The duplicate entries handler is done, and the only things left are the double click control / modal behavior and the error handler in my specific framework (since I used an Struts2 Action with result NONE and manual response writing instead of a Servlet). Still thanks for the idea ! – Andrea Ligios May 20 '13 at 16:30
2

May be you want to try multiple downloads concurrently. I found a discussion related to this here - Java multithreaded file downloading performance

Hope this helps.

Community
  • 1
  • 1
Indu Devanath
  • 2,068
  • 1
  • 16
  • 17
  • That is undoubtly an fascinating argument, but I don't think it would fit my case as it is now. Thanks the same @InduDevanath – Andrea Ligios May 17 '13 at 10:27