Let's say I have a structure like the following:
runner.py
is my main entry point that I provide to the Dataproc job. This in turns imports and calls gcp_pyspark_test2.py
. Currently I provide it as Additional python files in job submission. But my actual use case has multiple folder hierarchies. Is there a way to bundle it all up and only reference the top level directory (source root). You can assume that it is in the same structure on google storage bucket as it is on local. I am having a little difficulty in understanding how Dataproc resolves paths. In a local setup you will set the source root as part of your python path and then reference everything underneath it from there. But I don't know how to set that up for Dataproc. Any help is deeply appreciated.
UPDATE:
I am also ok with passing the files as python_file_uris. The main issue is that dataproc does not recognize the structure. In this scenario I am passing the gs://my-test-bucket/gcp_test/test_inner/gcp_pyspark_test2.py
as a python file but cannot import it in the runner either as gcp_test.test_inner.gcp_pyspark_test2
or as test_inner.gcp_pyspark_test2
UPDATE2:
I was able to get it to work by passing gcp_test.zip
as pyfiles and then referencing gcp_test.test_inner.gcp_pyspark_test2
in my runner file. This is what's throwing me off. From other answers online and how python works, the zip file should be my source root folder so I should be able to reference test_inner.gcp_pyspark_test2
but that import does not work. I can also see that the zip file is added to the python path on the cluster - ['/tmp/7c09ec5d-d6cf-4834-b6aa-6a26814e386x', '/tmp/7c09ec5d-d6cf-4834-b6aa-6a26814e386x/gcp_test.zip']
Can anyone explain what's going on. Thanks