My dataflow job has to download some file from remote server. I want to save the file on worker machine so job doesn't have to keep downloading the same file.
I tried to do this with setup
method, however it seems setup
will be called for each thread, and multiple threads can call setup
in parallel (I cannot find documentation around this, but based on my experience my job tries to write file data in parallel and hence causing malformed data).
Is there a way to perform one-time setup whenever worker machine is launched?
I also checked Apache Beam: DoFn.Setup equivalent in Python SDK but I believe it focuses around per-thread setup.