We have a large number of compressed files stored in a GCS bucket. I am attempting to bulk decompress them using the provided utility. The data is in a timestamp directory hierarchy; YEAR/MONTH/DAY/HOUR/files.txt.gz
. Dataflow accepts wildcard input patterns; inputFilePattern=gs://source-data/raw/nginx/2019/01/01/*/*.txt.gz
. However the directory structure is flattened at output. All the files are decompressed into a single directory. Is it possible to maintain the directory hierarchy using the bulk decompressor? Is there another possible solution?
gcloud dataflow jobs run gregstest \
--gcs-location gs://dataflow-templates/latest/Bulk_Decompress_GCS_Files \
--service-account-email greg@gmeow.com \
--project shopify-data-kernel \
--parameters \
inputFilePattern=gs://source-data/raw/nginx/2019/01/01/*/*.txt.gz,\
outputDirectory=gs://uncompressed-data/uncompressed,\
outputFailureFile=gs://uncompressed-data/failed