1

While running a python script on an EMR cluster using the spark submit command the process got stuck on 10% (can be seen through yarn application --list) and when I examined the logs, all cores executers presented the following type of message as there recent error:

Could not find valid SPARK_HOME while searching ['/mnt1/yarn/usercache/hadoop/appcache/application_x_0001', '/mnt/yarn/usercache/hadoop/filecache/11/pyspark.zip/pyspark', '/mnt1/yarn/usercache/hadoop/appcache/application_x_0001/container_x_0001_01_000002/pyspark.zip/pyspark', '/mnt1/yarn/usercache/hadoop/appcache/application_x_0001/container_x_0001_01_000002']

The code ran well localy, and since Spark was installed on all cores, I couldn't figure what is the cause for this issue and how to solve this error. Beside one post in Portuguese, without a clear answer i couldn't find any post with a solution for this issue.

YonGU
  • 658
  • 6
  • 13

1 Answers1

0

Finally I discovered that the cause for this error is the attempt to call the spark context object from functions who ran on cores and aren't part of the main script, while it was already created in the main script. Apparently the following command

from pyspark import SparkContext
sc = SparkContext.getOrCreate()

create a new SparkContext object even if it was already created in the main script on the master node. Therefor, in order to prevent this issue, in case sparkContext has to be used in a script which isn't the main script it has to be explicitly exported/imported from the main script to the side script (eg. as a parameter of a function) in order to avoid the following issue.

YonGU
  • 658
  • 6
  • 13
  • I'm facing the same problem, But I'm not very clear about the solution you posted, can you please explain more on how you fixed this issue? – user7343922 Oct 18 '21 at 22:29
  • If you could describe more in details when and where you face this issue i could better help (: – YonGU Oct 25 '21 at 11:17
  • I have posted a question, Can you please have a look? https://stackoverflow.com/questions/69592183/aws-emr-pyspark-rdd-mappartitions-could-not-find-valid-spark-home-while-sear – user7343922 Nov 01 '21 at 21:05