I'm trying to run a PySpark job that depends on certain python3 libraries.
I know I can install these libraries on the Spark Cluster, but since I'm reusing the cluster for multiple jobs, I'd like to rather bundle all dependencies and pass them to each job via the --py-files
directive.
To do this I use:
pip3 install -r requirements.txt --target ./build/dependencies
cd ./build/dependencies
zip -qrm . ../dependencies.zip
Which effectively zips all code from the required packages to be used at root level.
In my main.py
I can import the dependencies
if os.path.exists('dependencies.zip'):
sys.path.insert(0, 'dependencies.zip')
And also add the .zip to my Spark Context
sc.addPyFile('dependencies.zip')
So far so good.
But for some reason this will devolve in some kind of dependency hell on the Spark Cluster
Eg running
spark-submit --py-files dependencies.zip main.py
Where in main.py
(or class) I want to use a panda. The code it will trigger this error:
Traceback (most recent call last):
File "/Users/tomlous/Development/Python/enrichers/build/main.py", line 53, in job_module = importlib.import_module('spark.jobs.%s' % args.job_name) ...
File "", line 978, in _gcd_import
File "", line 961, in _find_and_load
File "", line 950, in _find_and_load_unlocked
File "", line 646, in _load_unlocked
File "", line 616, in _load_backward_compatible
File "dependencies.zip/spark/jobs/classify_existence.py", line 9, in
File "dependencies.zip/enrich/existence.py", line 3, in
File "dependencies.zip/pandas/init.py", line 19, in
ImportError: Missing required dependencies ['numpy']
Looking at panda's __init__.py
I see something like __import__(numpy)
So I assume numpy is not loaded.
But if I change my code to explicitly call numpy functions, it actually finds numpy, but not some of it's dependecies
import numpy as np
a = np.array([1, 2, 3])
The code returns
Traceback (most recent call last):
File "dependencies.zip/numpy/core/init.py", line 16, in
ImportError: cannot import name 'multiarray'
So my question is:
How should I bundle python3 libraries with my spark job in a way that I don't have to pip3 install all possible libraries on a Spark cluster?