I have written some Scala code that operates upon Spark DataFrame
s. I want my company's data scientists to be able to call it from PySpark (which they primarily use within Jupyter notebooks) hence I have written a thin Python wrapper around it that calls the Scala code (via py4j) which has been compiled into a JAR (foo.jar). I have packaged the jar and the wrapper (foo.py
) into a Python wheel (foo.whl).
When the wheel is pip
installed it is available at /path/to/site-packages/foo
and the JAR is at /path/to/site-packages/foo/jars/foo.jar
.
In foo.py I have the following code which installs the JAR into the ${SPARK_HOME}/jars directory
package_dir = os.path.dirname(os.path.realpath(__file__))
jar_file_path = os.path.join(package_dir, f"foo/jars/foo.jar")
tgt = f"{os.environ.get('SPARK_HOME')}/jars/foo.jar"
if os.path.islink(tgt):
print(f"Removing existing symlink {tgt}")
os.unlink(tgt)
os.symlink(jar_file_path, tgt)
When I or anyone wishing to use this runs import foo
then the JAR gets moved to the correct location where spark expects it to be and it can then be called from pyspark code. All works great.
Unfortunately our production environments are constrained, end users (rightfully) do not have sufficient permissions to allow them to affect the filesystem hence when the code above attempts to create the symlink it fails with a permissions error.
Is this solvable? I want to:
- make it really really easy for our data scientists to
pip install foo
and have the functionality of the package available to them - but also make the JAR available to spark without having to move it into
${SPARK_HOME}
Can anyone suggest a fix?
Some extra information requested by a commenter. Our Spark clusters are in fact GCP DataProc clusters (i.e. Google's managed service for hadoop/spark). The data is stored in Google storage buckets (GCS - Google's equivalent of S3) and the end users (who are using pyspark in Jupyter) do have access to those storage buckets.