How can I bundle a JAR inside a python package and make it available to pyspark?

Question

I have written some Scala code that operates upon Spark DataFrames. I want my company's data scientists to be able to call it from PySpark (which they primarily use within Jupyter notebooks) hence I have written a thin Python wrapper around it that calls the Scala code (via py4j) which has been compiled into a JAR (foo.jar). I have packaged the jar and the wrapper (foo.py) into a Python wheel (foo.whl).

When the wheel is pip installed it is available at /path/to/site-packages/foo and the JAR is at /path/to/site-packages/foo/jars/foo.jar.

In foo.py I have the following code which installs the JAR into the ${SPARK_HOME}/jars directory

package_dir = os.path.dirname(os.path.realpath(__file__))
jar_file_path = os.path.join(package_dir, f"foo/jars/foo.jar")
tgt = f"{os.environ.get('SPARK_HOME')}/jars/foo.jar"
if os.path.islink(tgt):
    print(f"Removing existing symlink {tgt}")
    os.unlink(tgt)
os.symlink(jar_file_path, tgt)

When I or anyone wishing to use this runs import foo then the JAR gets moved to the correct location where spark expects it to be and it can then be called from pyspark code. All works great.

Unfortunately our production environments are constrained, end users (rightfully) do not have sufficient permissions to allow them to affect the filesystem hence when the code above attempts to create the symlink it fails with a permissions error.

Is this solvable? I want to:

make it really really easy for our data scientists to pip install foo and have the functionality of the package available to them
but also make the JAR available to spark without having to move it into ${SPARK_HOME}

Can anyone suggest a fix?

Some extra information requested by a commenter. Our Spark clusters are in fact GCP DataProc clusters (i.e. Google's managed service for hadoop/spark). The data is stored in Google storage buckets (GCS - Google's equivalent of S3) and the end users (who are using pyspark in Jupyter) do have access to those storage buckets.

You have developed a Spark feature, with a Python wrapper -- so why do you focus on "the Pythonic way to do stuff" and not on "the Spark way to do stuff", or possibly "the Jupyter way to do stuff"? Please give some more context about how your users will run their PySpark jobs, or PySpark shell, or Jupyter notebook. — Samson Scharfrichter, May 13 '20 at 19:33
including whether you have some kind of shared storage available to Spark (i.e. HDFS or S3 or the like) — Samson Scharfrichter, May 13 '20 at 19:36
Recommended reading: `spark-submit` documentation about `--jars` option (also available for `pyspark` and Jupyter kerrnels, also available to Livy REST gateway with a different syntax and required whitelisting when refering to a local file) https://spark.apache.org/docs/latest/submitting-applications.html — Samson Scharfrichter, May 13 '20 at 19:39
Recommended reading: ScalaDoc for `SparkContext` about the `addJar` method https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext -- unfortunately the Python API has an `addPyFile` method instead, so it's not clear whether you can use it also for a JAR or must go through Py4J instead https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext — Samson Scharfrichter, May 13 '20 at 19:47
Also, that post deals with a very similar problem, but there is a smorgasbord of answers which make the whole topic very fuzzy https://stackoverflow.com/questions/27698111/how-to-add-third-party-java-jars-for-use-in-pyspark — Samson Scharfrichter, May 13 '20 at 19:50
Thx. I have edited the question to clarify that the users will be using Jupyter and the nature of shared storage. The SO question that you provided a link to does provide some interesting avenues for investigation so I shall look at that. Thank you. — jamiet, May 14 '20 at 07:10

Liam385 · Answer 1 · 2021-12-15T02:06:36.730

I believe this is what you are looking for.

Post-install script with Python setuptools

It looks like what you are trying to do is have an installation script that symlinks the jar file into sparks path when the user runs foo.py. The issue is that if the jvm is already started this wont work, and plus the user wont have the permissions to do this anyway.

What you should do instead is add a post install hook to your setup.py file so that when the user runs pip install it will automatically do the symlinking.

from setuptools.command.install import install
from setuptools import setup

class PostInstallCommand(install):
    """Post-installation for installation mode."""
    def run(self):
        install.run(self)
        package_dir = os.path.dirname(os.path.realpath(__file__))
        jar_file_path = os.path.join(package_dir, f"foo/jars/foo.jar")
        tgt = f"{os.environ.get('SPARK_HOME')}/jars/foo.jar"
        if os.path.islink(tgt):
            print(f"Removing existing symlink {tgt}")
            os.unlink(tgt)
        os.symlink(jar_file_path, tgt)

then insert cmdclass argument to setup() function in setup.py:

setup(
    ...

    cmdclass={
        'develop': PostDevelopCommand,
        'install': PostInstallCommand,
    },

    ...
)

If you have administrators setup the python environments for the datascientist this should solve the permissions issue

This is an interesting solution - one challenge I see with this is if the python environment has been created in advance, and is distributed to executors, e.g. using spark.yarn.dist.archives to distribute a python env that is a .tar.gz or .zip file. Because the pip install is happing on a different machine than spark will run, it seems as though this would be incompatible with this approach, if I understand correctly. — Brendan, Apr 07 '23 at 17:41

How can I bundle a JAR inside a python package and make it available to pyspark?

1 Answers1