6

I am using a ZipOutputStream to zip up a bunch of files that are a mix of already zipped formats as well as lots of large highly compressible formats like plain text.

Most of the already zipped formats are large files and it makes no sense to spend cpu and memory on recompressing them since they never get smaller and sometimes get slightly large on the rare occasion.

I am trying to use .setMethod(ZipEntry.STORED) when I detect a pre-compressed file but it complains that I need to supply the size, compressedSize and crc for those files.

I can get it work with the following approach but this requires that I read the file twice. Once to calculate the CRC32 then again to actually copy the file into the ZipOutputStream.

// code that determines the value of method omitted for brevity
if (STORED == method)
{
    fze.setMethod(STORED);
    fze.setCompressedSize(fe.attributes.size());
    final HashingInputStream his = new HashingInputStream(Hashing.crc32(), fis);
    ByteStreams.copy(his,ByteStreams.nullOutputStream());
    fze.setCrc(his.hash().padToLong());
}
else
{
    fze.setMethod(DEFLATED);
}
zos.putNextEntry(fze);
ByteStreams.copy(new FileInputStream(fe.path.toFile()), zos);
zos.closeEntry();

Is there a way provide this information without having to read the input stream twice?

2 Answers2

3

Short Answer:

I could not determine a way to read the files only once and calculate the CRC with the standard library given the time I had to solve this problem.

I did find an optimization that decreased the time by about 50% on average.

I pre-calculate the CRC of the files to be stored concurrently with an ExecutorCompletionService limited to Runtime.getRuntime().availableProcessors() and wait until they are done. The effectiveness of this varies based on the number of files that need the CRC calculated. With the more files, the more benefit.

Then in the .postVisitDirectories() I wrap a ZipOutputStream around a PipedOutputStream from a PipedInputStream/PipedOutputStream pair running on a temporary Thread to convert the ZipOutputStream to an InputStream I can pass into the HttpRequest to upload the results of the ZipOutputStream to a remote server while writing all the precalculated ZipEntry/Path objects serially.

This is good enough for now, to process the 300+GB of immediate needs, but when I get to the 10TB job I will look at addressing it and trying to find some more advantages without adding too much complexity.

If I come up with something substantially better time wise I will update this answer with the new implementation.

Long answer:

I ended up writing a clean room ZipOutputStream that supports multipart zip files, intelligent compression levels vs STORE and was able to calculate the CRC as I read and then write out the metadata at the end of the stream.


Why ZipOutputStream.setLevel() swapping will not work:

The ZipOutputStream.setLevel(NO_COMPRESSION/DEFAULT_COMPRESSION) hack is not a viable approach. I did extensive tests on hundreds of gigs of data, thousands of folders and files and the measurements were conclusive. It gains nothing over calculating the CRC for the STORED files vs compressing them at NO_COMPRESSION. It is actually slower by a large margin!

In my tests the files are on a network mounted drive so reading the files already compressed files twice over the network to calculate the CRC then again to add to the ZipOutputStream was as fast or faster than just processing all the files once as DEFLATED and changing the .setLevel() on the ZipOutputStream.

There is no local filesystem caching going on with the network access. This is a worse case scenario, processing files on the local disk will be much much faster because of local filesystem caching.

So this hack is a naive approach and is based on false assumptions. It is processing the data through the compression algorithm even at NO_COMPRESSION level and the overhead is higher than reading the files twice.

1

I could not determine a way to read the files only once and calculate the CRC with the standard library given the time I had to solve this problem.

I did find an optimization that decreased the time by about 50% on average.

I pre-calculate the CRC of the files to be stored concurrently...

I've measured about the same improvement, compared to alternating the ZipOutputStream.setLevel(Deflater.NO_COMPRESSION) and ZipOutputStream.setLevel(Deflater.DEFAULT_COMPRESSION), without concurrent CRC calculation using:

import java.io.IOException;
import java.nio.ByteBuffer;
import java.nio.MappedByteBuffer;
import java.nio.channels.Channels;
import java.nio.channels.FileChannel;
import java.nio.channels.WritableByteChannel;
import java.nio.channels.FileChannel.MapMode;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.zip.CRC32;
import java.util.zip.ZipEntry;
import java.util.zip.ZipOutputStream;

...

    void addTo(ZipOutputStream zipOut, Path file) throws IOException {
        try (FileChannel fch = FileChannel.open(file)) {
            MappedByteBuffer buf = fch.map(MapMode.READ_ONLY, 0, fch.size());
            ZipEntry entry = new ZipEntry(relativize(file));
            entry.setLastModifiedTime(Files.getLastModifiedTime(file));
            if (entry.getName().endsWith(".zip")
                    || entry.getName().endsWith(".gz")) {
                entry.setMethod(ZipEntry.STORED);
                entry.setSize(buf.remaining());
                entry.setCrc(checkSum(buf));
            }
            zipOut.putNextEntry(entry);
            @SuppressWarnings("resource")
            WritableByteChannel zipCh = Channels.newChannel(zipOut);
            zipCh.write(buf);
            zipOut.closeEntry();
        }
    }

    static long checkSum(ByteBuffer buf) {
        CRC32 crc = new CRC32();
        int mark = buf.position();
        crc.update(buf);
        buf.position(mark);
        return crc.getValue();
    }

(The relativize(Path) : String method is left out of the example.)

The CRC32 class provides very efficient update(ByteBuffer) method for use with memory-mapped (direct) file buffers.