I am trying to run Pyspark on Yarn, but I receive the following error, when I type any command on the console.
I am able run scala shell in Spark in both local and yarn mode. Pyspark runs fine in local mode, but does not work in yarn mode.
OS : RHEL 6.x
Hadoop Distribution : IBM BigInsights 4.0
Spark version :1.2.1
WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, work): org.apache.spark.SparkException: Error from python worker: /usr/bin/python: No module named pyspark PYTHONPATH was: /mnt/sdj1/hadoop/yarn/local/filecache/13/spark-assembly.jar (My Comment : This path is not present on the linux filesystem,but logical data nodes) java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:392) at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163) at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:102) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)
I have set the SPARK_HOME and PYTHONPATH via export commands, as follows
export SPARK_HOME=/path/to/spark
export PYTHONPATH=/path/to/spark/python/:/path/to/spark/lib/spark-assembly.jar
Can someone please help me out with this ?