I've tried running a pipeline on Google Cloud Dataflow (therefore with DataflowRunner), as well as with the DirectRunner on a Unix machine, and it seems to have a 100% sucess rate.
However, when running the same pipeline on Windows, with DirectRunner, it gets completely stuck ocasionally. If I press Ctrl + C on the Windows CMD, the execution continues perfectly fine.
The freezes can seemingly occur on any step of the pipeline, but they happen much more frequently during a ParDo process that performs an upload to an API, similar to this example. When the freezing happens in this step, pressing Ctrl + C prints the upload responses, meaning they had already been performed, and were stuck for no apparent reason. The problem also happens when uploading data to a different API. Most of the uploads are succesful.
I've tried setting network timeouts and limiting the execution to a single worker, with no success.
For reference, the pipeline is:
data = (
pipeline
| 'Read CSV File' >>
fileio.MatchFiles(dataflow_options.input_file)
| fileio.ReadMatches()
| beam.Reshuffle()
| beam.FlatMap(
lambda rf: csv.DictReader(io.TextIOWrapper(rf.open(), encoding='utf-8')))
)
batches = (
data
| 'Batch Data' >>
beam.util.BatchElements()
)
transformed = (
data
| 'Transform Data' >>
beam.Map(transformFn)
)
uploaded = (
transformed
| 'Upload Data' >>
beam.ParDo(UploadDoFn())
)
What could be the cause of the freezing? Could it be a library incompatibility on Windows? The logging library on debug mode wasn't particularly helpful, so I'm unsure on how to proceed.
Any help would be appreciated.