3

I have some custom jdbc drivers that I want to use in an application. I include these as --py-files when I spark submit to a Kubernetes spark cluster:

spark-submit --py-files s3a://bucket/pyfiles/pyspark_jdbc.zip my_application.py

This gives me:

java.io.FileNotFoundException: File file:/opt/spark/work-dir/pyspark_jdbc.zip does not exist

As other answers have told me, I need to actually add that zip file to the PYTHONPATH. Now, I find that to be no longer true with at least Spark 2.3+, but lets do it with:

spark.sparkContext.addPyFile("pyspark_jdbc.zip")

Looking into the cluster logs, I see:

19/10/21 22:40:56 INFO Utils: Fetching s3a://bucket/pyfiles/pyspark_jdbc.zip to 
/var/data/spark-52e390f5-85f4-41c4-9957-ff79f1433f64/spark-402e0a00-6806-40a7-a17d-5adf39a5c2d4/userFiles-680c1bce-ad5f-4a0b-9160-2c3037eefc29/fetchFileTemp5609787392859819321.tmp

So, the pyfiles got imported for sure, but into /var/data/... and not into my working directory. Therefore, when I go to add the location of my .zip file to my python path, I don't know where it is. Some diagnostics on the cluster right before attempting to add the python files:

> print(sys.path)
[..., 
 '/var/data/spark-52e390f5-85f4-41c4-9957-ff79f1433f64/spark-402e0a00-6806-40a7-a17d-5adf39a5c2d4/userFiles-680c1bce-ad5f-4a0b-9160-2c3037eefc29', 
 '/opt/spark/work-dir/s3a', 
 '//bucket/pyfiles/pyspark_jdbc.zip'
...]
> print(os.getcwd())
/opt/spark/work-dir
> subprocess.run(["ls", "-l"])
total 0

So we see that pyspark did attempt to add the s3a:// file I added via --py-files to PYTHONPATH, except that it mis-interpreted the : and did not add the path correctly. The /var/data/... directory is in the PYTHONPATH, but the specific .zip file is not so I cannot import from it.

How can I solve this problem going forward? The .zip file has not been correctly added to the path, but within my program, I do not know either

a. the path to the s3a:// that pyspark attempted to add to the PYTHONPATH

b. the path to the `var/data/.../ local location of the .zip file. I know it is in the path somewhere, and I suppose I could parse it out, but that would be messy.

What is an elegant solution to this?

kingledion
  • 2,263
  • 3
  • 25
  • 39

2 Answers2

6

A (better) solution is to use the SparkFiles object in pyspark to locate your imports.

from pyspark import SparkFiles

spark.sparkContext.addPyFile(SparkFiles.get("pyspark_jdbc.zp"))
kingledion
  • 2,263
  • 3
  • 25
  • 39
  • great answer!.. I was looking for this only. – stack0114106 Aug 03 '20 at 21:52
  • pyspark.sql.utils.IllegalArgumentException: requirement failed: File pyrepos.zip was already registered with a different path (old path = /tmp/spark-4d162bb8-66cb-469e-b7ee-d843c370598c/pyrepos.zip, new path = /var/data/spark-4d7d6bee-dbd0-4715-8509-22d584e4facd/spark-270dbe70-a0c7-4f5f-a645-bfb0120c6cef/userFiles-60d2ca4c-7da1-407e-ab6b-b814b164b9f7/pyrepos.zip – Dariusz Krynicki Aug 23 '22 at 16:03
  • it seems to me it no longer works. any idea why? – Dariusz Krynicki Aug 23 '22 at 16:03
1

A (bad) solution is to simply parse out the paths that look like they might hold the .zip file, and add them to sys.path.

for pth in [p for p in sys.path if p.startswith("/var/data/spark-")]:
    try:
        sys.path.append("{}/pyspark_jdbc.zip".format(pth))
    except:
        passed

This solution worked, allowing us to go into testing of our actual spark application, but I don't consider this a production ready solution.

kingledion
  • 2,263
  • 3
  • 25
  • 39