Dataflow GZIP TextIO ZipException: too many length or distance symbols

Question

Using TextIO.Read transform with a large collection of compressed text files (1000+ files, sizes between 100MB and 1.5GB), we sometimes get the following error:

java.util.zip.ZipException: too many length or distance symbols at
java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164) at
java.util.zip.GZIPInputStream.read(GZIPInputStream.java:117) at
java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at
java.io.BufferedInputStream.read1(BufferedInputStream.java:286) at
java.io.BufferedInputStream.read(BufferedInputStream.java:345) at
java.io.FilterInputStream.read(FilterInputStream.java:133) at
java.io.PushbackInputStream.read(PushbackInputStream.java:186) at 
com.google.cloud.dataflow.sdk.runners.worker.TextReader$ScanState.readBytes(TextReader.java:261) at 
com.google.cloud.dataflow.sdk.runners.worker.TextReader$TextFileIterator.readElement(TextReader.java:189) at 
com.google.cloud.dataflow.sdk.runners.worker.FileBasedReader$FileBasedIterator.computeNextElement(FileBasedReader.java:265) at 
com.google.cloud.dataflow.sdk.runners.worker.FileBasedReader$FileBasedIterator.hasNext(FileBasedReader.java:165) at 
com.google.cloud.dataflow.sdk.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:169) at 
com.google.cloud.dataflow.sdk.util.common.worker.ReadOperation.start(ReadOperation.java:118) at 
com.google.cloud.dataflow.sdk.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:66) at 
com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.executeWork(DataflowWorker.java:204) at 
com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.doWork(DataflowWorker.java:151) at 
com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.getAndPerformWork(DataflowWorker.java:118) at 
com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.call(DataflowWorkerHarness.java:139) at 
com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.call(DataflowWorkerHarness.java:124) at
java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at
java.lang.Thread.run(Thread.java:745)

Searching online for the same ZipException, only lead to this reply:

Zip file errors often happen when the hot deployer attempts to deploy an application before it is fully copied to the deploy directory. This is fairly common if it takes several seconds to copy the file. The solution is to copy the file to a temporary directory on the same disk partition as the application server, and then move the file to the deploy directory.

Did anybody else run into a similar exception? Or anyway we can fix this problem?

You shouldn't be using `TextIO` at all unless the files are text, and the answer that you've quoted appears to cover another possibility. — user207421, Jul 27 '15 at 10:27
Thanks for your reply. However, according to [the Dataflow documentation](https://cloud.google.com/dataflow/model/text-io#compressed), I believe TextIO should be the recommended approach for reading compressed text files, or did I miss something? — Fematich, Jul 27 '15 at 10:54
The thing you quote in your question is completely unrelated. Yeah it's a java.util.zip.ZipException, but other than that, this isn't at all related, as you can see from the rest of the stacktraces being completely different. — Nick, Jul 27 '15 at 17:35
It seems as though this is an internal error to the system, not some fault of your use-case, which sounds simple. If possible, you should open a [public issue tracker](https://code.google.com/p/googleappengine/wiki/FilingIssues?tm=3) defect report which contains all info you've gathered and hopefully enough info to attempt to reproduce the issue. — Nick, Jul 27 '15 at 17:40
It's possible that either the compressor you used is creating files which the Dataflow implementation can't handle (would be important information in a defect report, and you might want to even attach a file which causes it to fail), or the file is corrupted, or the streaming unzip is experiencing issues. — Nick, Jul 27 '15 at 17:41
Also be sure you're dealing with .gz files, a single text file gzipped, rather than a .zip file containing a single .txt — Nick, Jul 27 '15 at 17:47
Thanks! It is indeed an internal eror for this specific files... You can't do much wrong with "p.apply(TextIO.Read.from([gcs_path]))". However, I assume I'll probably need to submit the issue [at the Dataflow issue tracker on Github](https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues) instead of the app engine public issue tracker? — Fematich, Jul 28 '15 at 19:57

score 7 · Accepted Answer · answered Jul 28 '15 at 00:44

7

Looking at the code that produces the error message it seems to be a problem with zlib library (which is used by JDK) not supporting the format of gzip files that you have.

It looks to be the following bug in zlib: Codes for reserved symbols are rejected even if unused.

Unfortunately there's probably little we can do to help other than suggest producing these compressed file using another utility.

If you can produce a small example gzip file that we could use to reproduce the issue, we might be able to see if it is possible to work around somehow, but I wouldn't rely on this to succeed.

answered Jul 28 '15 at 00:44

Ivan Tarasov

7,038
5
27
23

Thanks! Unfortunately, I did not construct the files myself, and they contain very sensitive information... But I'll post the tool that has been used to compress the files once I know more... + try to recreate the error with some dummy data. – Fematich Jul 28 '15 at 19:48
Please follow up if you manage to recreate the error with the dummy non-sensitive data, I'd love to try to see if we can work around the problem somehow without affecting performance of the decompression. – Ivan Tarasov Jul 28 '15 at 23:27

score 1 · Answer 2 · answered Aug 18 '17 at 14:58

This question may be a bit old, but it was the first result in my Google search yesterday for this error:

HIVE_CURSOR_ERROR: too many length or distance symbols

After the tips here, I came to the realization that I had botched my gzip construction of the files I was trying to process. I had two processes writing gzip'd data out to the same output file, and the output files were corrupt because of it. Fixing the processes to write to unique files resolved the issue. I thought this answer might save another some time.

score 0 · Answer 3 · answered Apr 10 '20 at 12:49

0

I was getting this error in Spring boot. i had a Main project which will use Library project. I was using Spring actuator in main project. once i removed spring actuator it started wokring.

answered Apr 10 '20 at 12:49

U_R_Naveen UR_Naveen

720
7
8

Dataflow GZIP TextIO ZipException: too many length or distance symbols

3 Answers3