Spark problems with imports in Python

Question

We are running a spark-submit command on a python script that uses Spark to parallelize object detection in Python using Caffe. The script itself runs perfectly fine if run in a Python-only script, but it returns an import error when using it with Spark code. I know the spark code is not the problem because it works perfectly fine on my home machine, but it is not functioning well on AWS. I am not sure if this somehow has to do with the environment variables, it is as if it doesn't detect them.

These environment variables are set:

SPARK_HOME=/opt/spark/spark-2.0.0-bin-hadoop2.7
PATH=$SPARK_HOME/bin:$PATH
PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
PYTHONPATH=/opt/caffe/python:${PYTHONPATH}

Error:

16/10/03 01:36:21 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 172.31.50.167): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
 File "/opt/spark/spark-2.0.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 161, in main
   func, profiler, deserializer, serializer = read_command(pickleSer, infile)
 File "/opt/spark/spark-2.0.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 54, in read_command
   command = serializer._read_with_length(file)
 File "/opt/spark/spark-2.0.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 164, in _read_with_length
   return self.loads(obj)
 File "/opt/spark/spark-2.0.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 422, in loads
   return pickle.loads(obj)
 File "/opt/spark/spark-2.0.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 664, in subimport
   __import__(name)
ImportError: ('No module named caffe', <function subimport at 0x7efc34a68b90>, ('caffe',))

Does anyone know why this would be an issue?

This package from Yahoo manages what we're trying to do by shipping Caffe as a jar dependency and then uses it again in Python. But I haven't found any resources on how to build it and import it ourselves.

https://github.com/yahoo/CaffeOnSpark

Maybe you could try to load the python package using a `.egg` as suggested [here](http://stackoverflow.com/questions/24686474/shipping-python-modules-in-pyspark-to-other-nodes) — Bernard Jesop, Oct 03 '16 at 15:01
Tried that, didn't work. I've googled that apparently I have to export it as a jar package, but I don't know how to build it and then import it to Python. — alfredox, Oct 13 '16 at 02:05
If you built caffe from source, did you try adding caffe libraries to `LD_LIBRARY_PATH` manually? Did you try importing caffe from a python terminal? — ar7, Oct 13 '16 at 05:45
This may happen because of two possible reasons: the corresponding module, `caffe` isn't in PYTHONPATH, when running in spark mode (check `sys.path` before importing the module) and for some reasons the spark process doesn't have permissions to the `caffe` module file (you may try to check permissions by e.g. opening the corresponding file) — user3159253, Oct 15 '16 at 02:15
I have printed PYTHONPATH in each worker node and it has the caffe path "opt/caffe" setup properly, so does PATH. I feel that your second statement may be a clue, I do suspect when the workers run tasks for some reason they do not have permission or access to the rest of the file system. This may be fixed by passing the path as parameter with LD_LIBRARY_PATH or maybe some other Spark config that determines what user the tasks run under? — alfredox, Oct 17 '16 at 22:20

score 4 · Answer 1 · answered Oct 15 '16 at 02:33

4

You probably haven’t compiled the caffe python wrappers in your AWS environment. For reasons that completely escape me (and several others, https://github.com/BVLC/caffe/issues/2440) pycaffe is not available as a pypi package, and you have to compile it yourself. You should follow the compilation/make instructions here or automate it using ebextensions if you are in an AWS EB environment: http://caffe.berkeleyvision.org/installation.html#python

answered Oct 15 '16 at 02:33

2ps

15,099
2
27
47

I have, I have even ran the ipython command, and then did "import caffe" and it worked just fine. In fact I can run the pure python script stand alone in every node, and they all import caffe perfectly fine and do all processing. It's only in cluster mode that for some reason it ignores all env vars and doesn't want to pick up caffe. – alfredox Oct 15 '16 at 21:23
@alfredox: where do you run ipython that you can import caffe? – 2ps Oct 16 '16 at 00:49
In the worker nodes as "ubuntu" user. So you have your master, and then the workers connect to the master to form the cluster. I have ran that on each worker node and all works fine. It is almost as if when the workers receive tasks from master to run that script they are running the task inside some kind of container that does not give them access to the rest of the file system such as the opt/caffe path in order to properly import that library inside python. – alfredox Oct 17 '16 at 22:18

Spark problems with imports in Python

1 Answers1