1

My use case is that I am batch processing files in a bucket that is constantly being updated with new files. I don't want to process csv files that have already been processed.

Is there a way to do that?

One potential solution I thought of, is to have a text file that maintains a list of processed files and then reads all csv files excluding the files in the processed list. Is that possible?

Or is it possible to read a list of specific files?

agsolid
  • 1,288
  • 2
  • 14
  • 23
  • 1
    This is supported in Beam Java starting with 2.2 - see https://stackoverflow.com/questions/47896488/watching-for-new-files-matching-a-filepattern-in-apache-beam/47896489#47896489 – jkff Dec 19 '17 at 23:09

1 Answers1

1

There's not a good built-in way to do this, but you can have one stage of your pipeline that computes the list of files to read as you suggested, the using a DoFn that maps a filename to the contents of the file. See Reading multiple .gz file and identifying which row belongs to which file for information about how to write this DoFn

Community
  • 1
  • 1
danielm
  • 3,000
  • 10
  • 15