GCP Dataproc: How to submit an entire directory to pyspark Job using console/python/rest api

Question

Let's say I have a structure like the following:

runner.py is my main entry point that I provide to the Dataproc job. This in turns imports and calls gcp_pyspark_test2.py. Currently I provide it as Additional python files in job submission. But my actual use case has multiple folder hierarchies. Is there a way to bundle it all up and only reference the top level directory (source root). You can assume that it is in the same structure on google storage bucket as it is on local. I am having a little difficulty in understanding how Dataproc resolves paths. In a local setup you will set the source root as part of your python path and then reference everything underneath it from there. But I don't know how to set that up for Dataproc. Any help is deeply appreciated.

UPDATE:

I am also ok with passing the files as python_file_uris. The main issue is that dataproc does not recognize the structure. In this scenario I am passing the gs://my-test-bucket/gcp_test/test_inner/gcp_pyspark_test2.py as a python file but cannot import it in the runner either as gcp_test.test_inner.gcp_pyspark_test2 or as test_inner.gcp_pyspark_test2

UPDATE2:

I was able to get it to work by passing gcp_test.zip as pyfiles and then referencing gcp_test.test_inner.gcp_pyspark_test2 in my runner file. This is what's throwing me off. From other answers online and how python works, the zip file should be my source root folder so I should be able to reference test_inner.gcp_pyspark_test2 but that import does not work. I can also see that the zip file is added to the python path on the cluster - ['/tmp/7c09ec5d-d6cf-4834-b6aa-6a26814e386x', '/tmp/7c09ec5d-d6cf-4834-b6aa-6a26814e386x/gcp_test.zip']

Can anyone explain what's going on. Thanks

How about installing your code as a python package on all your workers ? Your runner.py would then only need to import the package and run your code — ma3oun, Apr 05 '20 at 14:57
So build up the whole thing as a wheel and install it on the cluster? Is there no easier way? I was hoping that I wouldn't have to fiddle around with my cluster or ssh into it — Fizi, Apr 05 '20 at 15:17
It's actually considered as "best practice". Another way involves using docker files but the idea is essentially the same: prepare your environment on the cluster workers and have a very basic runner script. — ma3oun, Apr 05 '20 at 15:36
could you point me to some resources online that detail how to do this. I am still new to cloud and spark world. Thanks again for your help — Fizi, Apr 05 '20 at 16:23
This question appears to have been answered before here: https://stackoverflow.com/questions/53863576 — tix, Apr 06 '20 at 00:07
@tix thanks, it did help but I still have an anomaly that I have updated my question with. If you can provide some light that will be great. Thanks — Fizi, Apr 06 '20 at 14:26
Do you zip gcp_test or its contents? When you open the zip file, do you see gcp_test? — tix, Apr 06 '20 at 16:18
I zipped the directory itself as in when I unzip it I see the directory gcp_test which has the content underneath it. Do I need to zip the contents instead? — Fizi, Apr 06 '20 at 16:30
That seemed to work and it makes sense. The zip file is the root. Thanks!! — Fizi, Apr 06 '20 at 16:41

GCP Dataproc: How to submit an entire directory to pyspark Job using console/python/rest api

0 Answers0