0

Is there any option to achieve in apache beam/dataflow streaming mode (Python SDK) something like this:

  • read from PubSub. Message will contain path to GCS file [Easy to do].
  • read this file in parallel manner (ReadAllFromText) [Easy to do].
  • Some DoFn transformation operating on single rows [Easy to do].
  • Something that will tell me that this file has been finished. If file had 100 rows, after 100 rows I would like to see e.g. logging.info("File X is over, had 100 rows").

I am not able to achieve this final point. The poor alternative is to use GcsIO.open to read GCS file inside some DoFn transformation, but this approach has no possibility to be parallelized.

Any ideas appreciated.

Edit: The reason of this is to achieve some kind of transnationality. I would like to have a control in the flow to for example:

  • count valid and invalid rows in each file.
  • once all rows of the file are uploaded somewhere, I would like to send pubsub message to other pipeline, for example:
{
"filename": "x",
"rows_valid": 100,
"rows_invalid": 44,
"rows_total": 1044
}

In this achievable?

Pav3k
  • 869
  • 4
  • 10

0 Answers0