Is there a way to read all files excluding a defined list of files in python apache beam?

Question

My use case is that I am batch processing files in a bucket that is constantly being updated with new files. I don't want to process csv files that have already been processed.

Is there a way to do that?

One potential solution I thought of, is to have a text file that maintains a list of processed files and then reads all csv files excluding the files in the processed list. Is that possible?

Or is it possible to read a list of specific files?

This is supported in Beam Java starting with 2.2 - see https://stackoverflow.com/questions/47896488/watching-for-new-files-matching-a-filepattern-in-apache-beam/47896489#47896489 — jkff, Dec 19 '17 at 23:09

score 1 · Answer 1 · edited May 23 '17 at 11:51

1

There's not a good built-in way to do this, but you can have one stage of your pipeline that computes the list of files to read as you suggested, the using a DoFn that maps a filename to the contents of the file. See Reading multiple .gz file and identifying which row belongs to which file for information about how to write this DoFn

edited May 23 '17 at 11:51

Community

1
1

answered Sep 20 '16 at 22:34

danielm

3,000
10
15

Is there a way to read all files excluding a defined list of files in python apache beam?

1 Answers1