2

Is there anyway to get the filename being processed when reading from GCS using:

p.apply("Read from GCS", TextIO.read().from("gs://path/*")).

I need the filename in next ParDo for storing the output into the appropriate table.

this question is similar to How to Get Filename when using file pattern match in google-cloud-dataflow but last update was more than a year ago, so wondering if there is a new functionality that enables this.

Moy
  • 21
  • 4

1 Answers1

1

You can't do this with TextIO per se, but Beam 2.2 includes transforms that allow you to do pretty much anything custom you want when reading files, using FileIO.match(), FileIO.readMatches(). See this answer . You'll need to use a DoFn<ReadableFile, String> that parses text files using regular Java facilities (as demonstrated in that answer) and also uses the ReadableFile's getMetadata() to get the filename.

jkff
  • 17,623
  • 5
  • 53
  • 85
  • Hi, but in that case (reading file in custom DoFn using, f.e. InputStream) there will be no parallelism, no file splitting, am I correct? – Markiza Nov 01 '22 at 11:44