I've recently set up a cluster(1 master & 2 slaves) on gcloud DataProc. I have managed to have a jupyter notebook interface with a PySpark kernel. Everything works as long as my workers do not have to execute code that needs foreign packages such as NumPy or sklearn. For example, I get this error :
ImportError: No module named 'sklearn'
a bit of the huge error log when I try to use pairwise_distance from sklearn
When I ssh on the workers and type
python
>>> help('modules')
I can see that all the packages are properly installed so it is not the problem.
When I type which python
I get a path let's say /opt/conda/bin/python
And when I check the PYSPARK_PYTHON with echo $PYSPARK_PYTHON
I get the same path. From this we can deduce that spark uses the "good" version of python which has all the packages installed. So it is not the problem.
I don't understand why my workers are not able to use packages since they are properly installed and PATHs variables seem fine.
Any clues ? I am a bit lost and hopeless so I might be omiting information, don't hesitate to ask.
For those wondering, I followed this link until step 4 to set up my environment PySpark on gcloud.