I am trying to run a Python script in Spark. I am running Spark in client mode (i.e. single node) with a Python script that has some dependencies (e.g. pandas
) installed via Conda. There are various resources which cover this usage case, for example:
- https://conda.github.io/conda-pack/spark.html
- https://databricks.com/blog/2020/12/22/how-to-manage-python-dependencies-in-pyspark.html
Using those as an example I run Spark via the following command in the Spark bin
directory, where /tmp/env.tar
is the Conda environment packed by conda-pack
:
export PYSPARK_PYTHON=./environment/bin/python
./spark-submit --archives=/tmp/env.tar#environment script.py
Spark throws the following exception:
java.io.IOException: Cannot run program "./environment/bin/python": error=2, No such file or directory
Why does this not work? I am curious also about the ./
in the Python path as it's not clear where Spark unpacks the tar file. I assumed I did not need to load the tar file into HDFS since this is all running on a single node (but perhaps I do for cluster mode?).