Find error record file while processing too many files in same bucket in apache beam java sdk

Question

I have 20 files (csv files) in the same bucket. I am able to read all the file in one go and load on to bigquery. But when there is some data type mismatches, im able to get that row into invalidDataTag where as i am unable to find the file name that has the error record.

inputFilePattern is gs://bucket-name/* this picks up all the files that are present under the bucket. and reading the files as below

PCollection<String> sourceData = pipeline.apply(Constants.READ_CSV_STAGE_NAME, TextIO.read().from(options.getInputFilePattern()));

Is there a way where i can find the file name that has the error row in it ?

As far as I know only using FileIO you can get the metadata of the file, which then you can access when you open a file in a ParDo. — Saransh, Jun 11 '22 at 18:33

score 0 · Answer 1 · answered Jun 09 '22 at 00:55

0

My suggestion would be to add a column to the BigQuery table that indicates which file the record came from.

answered Jun 09 '22 at 00:55

Kenn Knowles

5,838
18
22

Thanks for your response and could you please help us with sample code the approach ? – raj Jun 09 '22 at 05:54

Find error record file while processing too many files in same bucket in apache beam java sdk

1 Answers1