1

I've tried running a pipeline on Google Cloud Dataflow (therefore with DataflowRunner), as well as with the DirectRunner on a Unix machine, and it seems to have a 100% sucess rate.

However, when running the same pipeline on Windows, with DirectRunner, it gets completely stuck ocasionally. If I press Ctrl + C on the Windows CMD, the execution continues perfectly fine.

The freezes can seemingly occur on any step of the pipeline, but they happen much more frequently during a ParDo process that performs an upload to an API, similar to this example. When the freezing happens in this step, pressing Ctrl + C prints the upload responses, meaning they had already been performed, and were stuck for no apparent reason. The problem also happens when uploading data to a different API. Most of the uploads are succesful.

I've tried setting network timeouts and limiting the execution to a single worker, with no success.

For reference, the pipeline is:

data = (
        pipeline
        | 'Read CSV File' >>
            fileio.MatchFiles(dataflow_options.input_file)
        | fileio.ReadMatches()
        | beam.Reshuffle()
        | beam.FlatMap(
            lambda rf: csv.DictReader(io.TextIOWrapper(rf.open(), encoding='utf-8')))   
)

batches = (
        data
        | 'Batch Data' >>
            beam.util.BatchElements()
    )

transformed = (
        data
        | 'Transform Data' >>
            beam.Map(transformFn)
    )

uploaded = (
        transformed
        | 'Upload Data' >>
            beam.ParDo(UploadDoFn())
    )

What could be the cause of the freezing? Could it be a library incompatibility on Windows? The logging library on debug mode wasn't particularly helpful, so I'm unsure on how to proceed.

Any help would be appreciated.

Nivaldo T
  • 83
  • 1
  • 6

1 Answers1

0

I found a solution. Turns out the Windows CMD was actually at fault, not Beam: Why is my command prompt freezing on Windows 10?

Nivaldo T
  • 83
  • 1
  • 6