Short Answer:
I could not determine a way to read the files only once and calculate the CRC
with the standard library given the time I had to solve this problem.
I did find an optimization that decreased the time by about 50%
on average.
I pre-calculate the CRC
of the files to be stored concurrently with an ExecutorCompletionService
limited to Runtime.getRuntime().availableProcessors()
and wait until they are done. The effectiveness of this varies based on the number of files that need the CRC
calculated. With the more files, the more benefit.
Then in the .postVisitDirectories()
I wrap a ZipOutputStream
around a PipedOutputStream
from a PipedInputStream/PipedOutputStream
pair running on a temporary Thread
to convert the ZipOutputStream
to an InputStream
I can pass into the HttpRequest
to upload the results of the ZipOutputStream
to a remote server while writing all the precalculated ZipEntry/Path
objects serially.
This is good enough for now, to process the 300+GB
of immediate needs, but when I get to the 10TB
job I will look at addressing it and trying to find some more advantages without adding too much complexity.
If I come up with something substantially better time wise I will update this answer with the new implementation.
Long answer:
I ended up writing a clean room ZipOutputStream
that supports multipart zip files, intelligent compression levels vs STORE
and was able to calculate the CRC
as I read and then write out the metadata at the end of the stream.
Why ZipOutputStream.setLevel() swapping will not work:
The ZipOutputStream.setLevel(NO_COMPRESSION/DEFAULT_COMPRESSION)
hack is not a viable approach. I did extensive tests on hundreds of
gigs of data, thousands of folders and files and the measurements were
conclusive. It gains nothing over calculating the CRC
for the
STORED
files vs compressing them at NO_COMPRESSION
. It is actually
slower by a large margin!
In my tests the files are on a network mounted drive so reading
the files already compressed files twice over the network to
calculate the CRC
then again to add to the ZipOutputStream
was as
fast or faster than just processing all the files once as DEFLATED
and changing the .setLevel()
on the ZipOutputStream
.
There is no local filesystem caching going on with the network access.
This is a worse case scenario, processing files on the local disk will
be much much faster because of local filesystem caching.
So this hack is a naive approach and is based on false assumptions. It is processing the
data through the compression algorithm even at NO_COMPRESSION
level
and the overhead is higher than reading the files twice.