3

Has anything changed recently with the way google dataflow reads compressed files from google cloud storage? I am working on a project that reads compressed csv log files from GCS and uses these files as the source for a dataflow pipeline. Until recently this worked perfectly with and without specifying the compression type of the file.

Currently the processElement method in my DoFn is only called once (for the csv header row) although the file has many rows. If I use the same source file uncompressed then everything works as expected (the processElement method is called for every row). As suggested here https://stackoverflow.com/a/27775968/6142412 setting the Content-Encoding to gzip does work but I did not have to do this previously.

I am expereincing this issue when using DirectPipelineRunner or DataflowPipelineRunner. I am using version 1.5.0 of the cloud-data-flow sdk.

Community
  • 1
  • 1
Davin
  • 233
  • 6
  • 12
  • Sorry for the trouble. Can you clarify what you're using to read the input (is it TextIO? Do you set any parameters on it except the filename?) and give an example misbehaving job ID? – jkff Mar 31 '16 at 23:50
  • Also please clarify how you're detecting that your DoFn is only called for the header row? – jkff Apr 01 '16 at 00:38
  • @jkff To detect the DoFn processElement calls I have added break points in the processElement method when debugging with DirectPipelineRunner. I am using TextIO to read the input. `code` TextIO.Read.named(String.format("Read %s", fileName)).from(fileName).withCompressionType(TextIO.CompressionType.GZIP) – Davin Apr 01 '16 at 00:57
  • What was the content encoding and content type for the files when it didn't work? – Lukasz Cwik Apr 01 '16 at 16:10
  • Also, were the files named yyy.csv or yyy.csv.gz or yyy.gz within GCS? – Lukasz Cwik Apr 01 '16 at 17:18
  • @LukaszCwik When the files did not work the Content-Type was application/octet-stream and the Content-Encoding was empty. Within GCS the files are named xyz.gz – Davin Apr 03 '16 at 21:03

1 Answers1

0

We identified a problem (BEAM-167) reading from concatenated gzip files. It has been fixed in the Apache Beam Github repository by PR 114 and the Dataflow SDK Github repository by PR 180. It will be part of the next release.

Until then, a work around is to use the SDK built from Github or to compress the entire file as a single part.

Ben Chambers
  • 6,070
  • 11
  • 16