2

I'm trying to use google cloud dataflow to read data from GCS and load to BigQuery tables, however the files in GCS are compressed(gzip), is there any class can be used to read data from compressed/gzipped files?

Mark Rotteveel
  • 100,966
  • 191
  • 140
  • 197
Echo
  • 405
  • 1
  • 6
  • 15
  • Does this answer your question? [Reading from compressed files in Dataflow](https://stackoverflow.com/questions/27733741/reading-from-compressed-files-in-dataflow) – Mark Rotteveel Aug 27 '21 at 09:59

1 Answers1

6

Reading from compressed text sources is now supported in Dataflow (as of this commit). Specifically, files compressed with gzip and bzip2 can be read from by specifying the compression type:

TextIO.Read.from(myFileName).withCompressionType(TextIO.CompressionType.GZIP)

However, if the file has a .gz or .bz2 extension, you don't have do do anything: the default compression type is AUTO, which examines file extensions to determine the correct compression type for a file. This even works with globs, where the files that result from the glob may be a mix of .gz, .bz2, and uncompressed.

MattL
  • 173
  • 6
  • Great! thanks for the update. what about performance, any performance impact on reading from compressed data? – Echo Feb 09 '15 at 19:32
  • 1
    No worries! The largest performance impact is that a compressed text file will not be automatically split and read by multiple workers in parallel. Reads from many files will be parallelized, but the smallest unit of work is a single file. Unfortunately, I don't have any benchmark data or numbers to share right now. Hope this helps! – MattL Feb 13 '15 at 19:12