GCP Bulk Decompress maintaining file structure

Question

We have a large number of compressed files stored in a GCS bucket. I am attempting to bulk decompress them using the provided utility. The data is in a timestamp directory hierarchy; YEAR/MONTH/DAY/HOUR/files.txt.gz. Dataflow accepts wildcard input patterns; inputFilePattern=gs://source-data/raw/nginx/2019/01/01/*/*.txt.gz. However the directory structure is flattened at output. All the files are decompressed into a single directory. Is it possible to maintain the directory hierarchy using the bulk decompressor? Is there another possible solution?

gcloud dataflow jobs run gregstest \
    --gcs-location gs://dataflow-templates/latest/Bulk_Decompress_GCS_Files \
    --service-account-email greg@gmeow.com \
    --project shopify-data-kernel \
    --parameters \
inputFilePattern=gs://source-data/raw/nginx/2019/01/01/*/*.txt.gz,\
outputDirectory=gs://uncompressed-data/uncompressed,\
outputFailureFile=gs://uncompressed-data/failed

aga · Accepted Answer · 2020-02-14T14:47:26.463

1

I have looked for Java code of bulk decompressor and the PipelineResult method does following steps:

Find all files matching the input pattern
Decompress the files found and output them to the output directory
Write any errors to the failure output file

It looks like API decompress only files, not directories with files. I recommend you to check this thread on Stackoverflow with possible solutions concerning decompress file in GCS.

I hope you find the above pieces of information useful.

edited Feb 14 '20 at 14:47

answered Feb 14 '20 at 13:58

aga

3,790
3
11
18

Cheers, as far as I know, directories don't exist in GCS. They just fake it by allowing `/` in filenames. But apparently the bulk decompressor ignores that :( – Gregology Feb 18 '20 at 17:07

GCP Bulk Decompress maintaining file structure

1 Answers1