PySpark worker cannot import packages while they are installed

Question

I've recently set up a cluster(1 master & 2 slaves) on gcloud DataProc. I have managed to have a jupyter notebook interface with a PySpark kernel. Everything works as long as my workers do not have to execute code that needs foreign packages such as NumPy or sklearn. For example, I get this error :

ImportError: No module named 'sklearn'

a bit of the huge error log when I try to use pairwise_distance from sklearn

When I ssh on the workers and type

python
>>> help('modules')

I can see that all the packages are properly installed so it is not the problem.

When I type which python I get a path let's say /opt/conda/bin/python And when I check the PYSPARK_PYTHON with echo $PYSPARK_PYTHON I get the same path. From this we can deduce that spark uses the "good" version of python which has all the packages installed. So it is not the problem.

I don't understand why my workers are not able to use packages since they are properly installed and PATHs variables seem fine.

Any clues ? I am a bit lost and hopeless so I might be omiting information, don't hesitate to ask.

For those wondering, I followed this link until step 4 to set up my environment PySpark on gcloud.

Did you run `!pip install --upgrade scikit-learn` from the notebook before importing it? Did you make sure you are using the right interpreter [from the notebook too](https://stackoverflow.com/questions/40694528/how-to-know-which-is-running-in-jupyter-notebook)? — Guillem Xercavins, Feb 21 '18 at 08:41

score 0 · Answer 1 · answered Feb 22 '18 at 11:05

Ok, so I managed to fix it.

The main "issue" so to say was that I was connected as an user on my cluster and that I lacked so priviledges especially is the /opt/conda/ directory.

Therefore when I used pip or conda to install packages it failed.

I tried to used pip with the --user option, it installed stuff but not where I wanted.

At this point, I was fustrated because I could not install packages like everybody was telling me to do using pip and conda but as mentionned on the original post it looked like they were already properly installed.

Since priviledges prevented me to use pip and conda efficiently and since this is a virtual machine in the cloud somewhere, I decided to change the ownershp of the /opt/conda/ directory with sudo chown.

Being owner of /opt/conda/ permitted me too install all the packages with conda and it worked after. My PySpark Notebook was running just fine.

I do not recommend this solution to users facing this problem on their own private machine but my problem was on a virtual machine in the Google cloud used only for PySpark so the risk that changing ownership will come back at me one day is pretty low.

If somebody has a better and cleaner solution, fell free to post it here.

PySpark worker cannot import packages while they are installed

1 Answers1