I have some custom jdbc drivers that I want to use in an application. I include these as --py-files when I spark submit to a Kubernetes spark cluster:
spark-submit --py-files s3a://bucket/pyfiles/pyspark_jdbc.zip my_application.py
This gives me:
java.io.FileNotFoundException: File file:/opt/spark/work-dir/pyspark_jdbc.zip does not exist
As other answers have told me, I need to actually add that zip file to the PYTHONPATH. Now, I find that to be no longer true with at least Spark 2.3+, but lets do it with:
spark.sparkContext.addPyFile("pyspark_jdbc.zip")
Looking into the cluster logs, I see:
19/10/21 22:40:56 INFO Utils: Fetching s3a://bucket/pyfiles/pyspark_jdbc.zip to
/var/data/spark-52e390f5-85f4-41c4-9957-ff79f1433f64/spark-402e0a00-6806-40a7-a17d-5adf39a5c2d4/userFiles-680c1bce-ad5f-4a0b-9160-2c3037eefc29/fetchFileTemp5609787392859819321.tmp
So, the pyfiles got imported for sure, but into /var/data/...
and not into my working directory. Therefore, when I go to add the location of my .zip file to my python path, I don't know where it is. Some diagnostics on the cluster right before attempting to add the python files:
> print(sys.path)
[...,
'/var/data/spark-52e390f5-85f4-41c4-9957-ff79f1433f64/spark-402e0a00-6806-40a7-a17d-5adf39a5c2d4/userFiles-680c1bce-ad5f-4a0b-9160-2c3037eefc29',
'/opt/spark/work-dir/s3a',
'//bucket/pyfiles/pyspark_jdbc.zip'
...]
> print(os.getcwd())
/opt/spark/work-dir
> subprocess.run(["ls", "-l"])
total 0
So we see that pyspark did attempt to add the s3a://
file I added via --py-files
to PYTHONPATH, except that it mis-interpreted the :
and did not add the path correctly. The /var/data/...
directory is in the PYTHONPATH, but the specific .zip file is not so I cannot import from it.
How can I solve this problem going forward? The .zip file has not been correctly added to the path, but within my program, I do not know either
a. the path to the s3a:// that pyspark attempted to add to the PYTHONPATH
b. the path to the `var/data/.../ local location of the .zip file. I know it is in the path somewhere, and I suppose I could parse it out, but that would be messy.
What is an elegant solution to this?