1

My dataflow job has to download some file from remote server. I want to save the file on worker machine so job doesn't have to keep downloading the same file.

I tried to do this with setup method, however it seems setup will be called for each thread, and multiple threads can call setup in parallel (I cannot find documentation around this, but based on my experience my job tries to write file data in parallel and hence causing malformed data).

Is there a way to perform one-time setup whenever worker machine is launched?

I also checked Apache Beam: DoFn.Setup equivalent in Python SDK but I believe it focuses around per-thread setup.

Kazuki
  • 1,462
  • 14
  • 34

1 Answers1

1

The Beam model doesn't include a specific callback for when a VM is created because the model doesn't guarantee the runtime environment. However, because you are using Dataflow that uses containers you have two options:

The first will give you direct control over the container image, and it works for all languages. The second only works for Python.

Cubez
  • 878
  • 5
  • 11
  • Just to confirm for custom image on dataflow I would have to use flex template right? https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates It has its own weirdness (I posted some other questions around this) and I didn't see success yet unfortunately – Kazuki Nov 24 '20 at 01:11