module error in multi-node spark job on google cloud cluster

Question

This code runs perfect when I set master to localhost. The problem occurs when I submit on a cluster with two worker nodes.

All the machines have same version of python and packages. I have also set the path to point to the desired python version i.e. 3.5.1. when I submit my spark job on the master ssh session. I get the following error -

py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 5, .c..internal): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/hadoop/yarn/nm-local-dir/usercache//appcache/application_1469113139977_0011/container_1469113139977_0011_01_000004/pyspark.zip/pyspark/worker.py", line 98, in main command = pickleSer._read_with_length(infile) File "/hadoop/yarn/nm-local-dir/usercache//appcache/application_1469113139977_0011/container_1469113139977_0011_01_000004/pyspark.zip/pyspark/serializers.py", line 164, in _read_with_length return self.loads(obj) File "/hadoop/yarn/nm-local-dir/usercache//appcache/application_1469113139977_0011/container_1469113139977_0011_01_000004/pyspark.zip/pyspark/serializers.py", line 419, in loads return pickle.loads(obj, encoding=encoding) File "/hadoop/yarn/nm-local-dir/usercache//appcache/application_1469113139977_0011/container_1469113139977_0011_01_000004/pyspark.zip/pyspark/mllib/init.py", line 25, in import numpy ImportError: No module named 'numpy'

I saw other posts where people did not have access to their worker nodes. I do. I get the same message for the other worker node. not sure if I am missing some environment setting. Any help will be much appreciated.

tried installing numpy? http://stackoverflow.com/questions/1273203/cant-import-numpy-in-python — Ronak Patel, Jul 21 '16 at 23:54

score 0 · Answer 1 · answered Jul 25 '16 at 16:26

Not sure if this qualifies as a solution. I submitted the same job using dataproc on google platform and it worked without any problem. I believe the best way to run jobs on google cluster is via the utilities offered on google platform. The dataproc utility seems to iron out any issues related to the environment.

module error in multi-node spark job on google cloud cluster

1 Answers1