Using TextIO.Read transform with a large collection of compressed text files (1000+ files, sizes between 100MB and 1.5GB), we sometimes get the following error:
java.util.zip.ZipException: too many length or distance symbols at
java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164) at
java.util.zip.GZIPInputStream.read(GZIPInputStream.java:117) at
java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at
java.io.BufferedInputStream.read1(BufferedInputStream.java:286) at
java.io.BufferedInputStream.read(BufferedInputStream.java:345) at
java.io.FilterInputStream.read(FilterInputStream.java:133) at
java.io.PushbackInputStream.read(PushbackInputStream.java:186) at
com.google.cloud.dataflow.sdk.runners.worker.TextReader$ScanState.readBytes(TextReader.java:261) at
com.google.cloud.dataflow.sdk.runners.worker.TextReader$TextFileIterator.readElement(TextReader.java:189) at
com.google.cloud.dataflow.sdk.runners.worker.FileBasedReader$FileBasedIterator.computeNextElement(FileBasedReader.java:265) at
com.google.cloud.dataflow.sdk.runners.worker.FileBasedReader$FileBasedIterator.hasNext(FileBasedReader.java:165) at
com.google.cloud.dataflow.sdk.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:169) at
com.google.cloud.dataflow.sdk.util.common.worker.ReadOperation.start(ReadOperation.java:118) at
com.google.cloud.dataflow.sdk.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:66) at
com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.executeWork(DataflowWorker.java:204) at
com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.doWork(DataflowWorker.java:151) at
com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.getAndPerformWork(DataflowWorker.java:118) at
com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.call(DataflowWorkerHarness.java:139) at
com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.call(DataflowWorkerHarness.java:124) at
java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at
java.lang.Thread.run(Thread.java:745)
Searching online for the same ZipException, only lead to this reply:
Zip file errors often happen when the hot deployer attempts to deploy an application before it is fully copied to the deploy directory. This is fairly common if it takes several seconds to copy the file. The solution is to copy the file to a temporary directory on the same disk partition as the application server, and then move the file to the deploy directory.
Did anybody else run into a similar exception? Or anyway we can fix this problem?