Is Dataflow making use of Google Cloud Storage's gzip transcoding?

Question

I am trying to process JSON files (10 GB uncompressed/2 GB compressed) and I want to optimize my pipeline.

According to the official docs Google Cloud Storage (GCS) has the option to transcode gzip files, which means the application gets them uncompressed, when they are tagged correctly. Google Cloud Dataflow (GCDF) has better parallelism when dealing with uncompressed files, so I was wondering if setting the meta tag on GCS has a positive effect on performance?

Since my input files are relatively large, does it make sense to unzip them so that Dataflow splits them in smaller chunks?

You should not use this meta tag. It's actually dangerous, as GCS would report the size of your file incorrectly (e.g. report the compressed size, but dataflow/beam would read the uncompressed data). In any case, the splitting of uncompressed files relies on reading in parallel from different segments of a file, and this is not possible if the file is originally compressed. Hope this helps. : ) — Pablo, Feb 13 '17 at 23:39

score 2 · Accepted Answer · answered Feb 14 '17 at 17:34

You should not use this meta tag. It's dangerous, as GCS would report the size of your file incorrectly (e.g. report the compressed size, but dataflow/beam would read the uncompressed data).

In any case, the splitting of uncompressed files relies on reading in parallel from different segments of a file, and this is not possible if the file is originally compressed.

Is Dataflow making use of Google Cloud Storage's gzip transcoding?

1 Answers1

Linked