Reading from compressed files in Dataflow

Question

Is there a way (or any kind of hack) to read input data from compressed files? My input consists of a few hundreds of files, which are produced as compressed with gzip and decompressing them is somewhat tedious.

score 6 · Answer 1 · answered Feb 06 '15 at 20:31

6

Reading from compressed text sources is now supported in Dataflow (as of this commit). Specifically, files compressed with gzip and bzip2 can be read from by specifying the compression type:

TextIO.Read.from(myFileName).withCompressionType(TextIO.CompressionType.GZIP)

However, if the file has a .gz or .bz2 extension, you don't have do do anything: the default compression type is AUTO, which examines file extensions to determine the correct compression type for a file. This even works with globs, where the files that result from the glob may be a mix of .gz, .bz2, and uncompressed.

answered Feb 06 '15 at 20:31

MattL

173
6

Thanks, how does it interact with the metadata headers? Do I need to set all the files to be binary or I can keep them as text/plain? – G B Feb 06 '15 at 22:04
It looks like there is an issue if the Content-Encoding header, but not with the Content-Type header. If you clear the Content-Encoding header, the read will succeed. `gsutil -m setmeta -h "Content-Encoding:" ` – MattL Feb 06 '15 at 22:44
That doesn't seem to work yet (without specifying compression). Should I be waiting for the new SDK or it should be working already? – G B Feb 08 '15 at 21:00
1

As of yesterday, you should not have to change the Content-Encoding header. Dataflow will read from files that have the Content-Encoding metadata set as well as files that don't. However, if files do not have a .gz extension, you currently need to explicitly set the compression type to gzip. You'll need to grab the latest version of the SDK. – MattL Feb 13 '15 at 19:15
I'm getting "Not in GZIP format" errors if the Content-Encoding header is still "gzip" and content-type is text/plain. Seems to work fine if encoding is clean and content type is binary – G B Feb 23 '15 at 14:36

Jeremy Lewi · Answer 2 · 2015-01-06T19:19:09.573

The slower performance with my work around was most likely because Dataflow was putting most of the files in the same split so they weren't being processed in parallel. You can try the following to speed things up.

Create a PCollection for each file by applying the Create transform multiple times (each time to a single file).
Use the Flatten transform to create a single PCollection containing all the files from PCollections representing individual files.
Apply your pipeline to this PCollection.

score 1 · Accepted Answer · answered Jan 05 '15 at 08:44

1

I also found that for files that reside in the cloud store, setting the content type and content encoding appears to "just work" without the need for a workaround.

Specifically - I run

gsutil -m setmeta -h "Content-Encoding:gzip" -h "Content-Type:text/plain" <path>

answered Jan 05 '15 at 08:44

G B

755
6
16

This makes me nervous because it will likely interfere with our logic for splitting files into work units. As I recall on another thread, you have a bunch of small files which might be why it works. I don't think this is a good general solution. We're working on that right now though. – Frances Jan 06 '15 at 19:18
The largest files are around 7MB when compressed and ~60MB uncompressed. – G B Jan 06 '15 at 21:20
Yeah, given the number of files you have, we likely aren't trying to split those. (I can confirm if you send me a job id.) – Frances Jan 07 '15 at 22:04
As a note, using this meta tag might cause data loss, since GCS would report the size of the compressed file, and dataflow/beam would read uncompressed data. Please avoid it. – Pablo Feb 14 '17 at 17:39
@Pablo Is your comment applicable because there is contradictory information in Google's official documentation: https://cloud.google.com/storage/docs/transcoding I am actually having some differing results in Apache Beam 2.0 with compressed and the same data uncompressed when run on DataflowRunner... – Guille Jun 22 '17 at 10:20

score 1 · Answer 4 · answered Feb 12 '15 at 08:07

1

I just noticed that specifying the compression type is now available in the latest version of the SDK (v0.3.150210). I've tested it, and was able to load my GZ files directly from GCS to BQ without any problems.

answered Feb 12 '15 at 08:07

Graham Polley

14,393
4
44
80

Reading from compressed files in Dataflow

4 Answers4

Linked