I am trying to process JSON files (10 GB uncompressed/2 GB compressed) and I want to optimize my pipeline.
According to the official docs Google Cloud Storage (GCS) has the option to transcode gzip files, which means the application gets them uncompressed, when they are tagged correctly. Google Cloud Dataflow (GCDF) has better parallelism when dealing with uncompressed files, so I was wondering if setting the meta tag on GCS has a positive effect on performance?
Since my input files are relatively large, does it make sense to unzip them so that Dataflow splits them in smaller chunks?