0

I have 20 files (csv files) in the same bucket. I am able to read all the file in one go and load on to bigquery. But when there is some data type mismatches, im able to get that row into invalidDataTag where as i am unable to find the file name that has the error record.

inputFilePattern is gs://bucket-name/* this picks up all the files that are present under the bucket. and reading the files as below

PCollection<String> sourceData = pipeline.apply(Constants.READ_CSV_STAGE_NAME, TextIO.read().from(options.getInputFilePattern()));

Is there a way where i can find the file name that has the error row in it ?

raj
  • 1
  • 1
  • As far as I know only using FileIO you can get the metadata of the file, which then you can access when you open a file in a ParDo. – Saransh Jun 11 '22 at 18:33

1 Answers1

0

My suggestion would be to add a column to the BigQuery table that indicates which file the record came from.

Kenn Knowles
  • 5,838
  • 18
  • 22
  • Thanks for your response and could you please help us with sample code the approach ? – raj Jun 09 '22 at 05:54