0

I have a pyspark code stored both on the master node of an AWS EMR cluster and in an s3 bucket that fetches over 140M rows from a MySQL database and stores the sum of a column back in the log files on s3.

When I spark-submit the pyspark code on the master node, the job gets completed successfully and the output is stored in the log files on the S3 bucket.

However, when I spark-submit the pyspark code on the S3 bucket using these- (using the below commands on the terminal after SSH-ing to the master node)

  1. spark-submit --master yarn --deploy-mode cluster --py-files s3://bucket_name/my_script.py This returns a Error: Missing application resource. error.

  2. spark_submit s3://bucket_name/my_script.py This shows :

20/07/02 11:26:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Exception in thread "main" java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2369)
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2840)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2857)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2896)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2878)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:392)
    at org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1911)
    at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:766)
    at org.apache.spark.deploy.DependencyUtils$.downloadFile(DependencyUtils.scala:137)
    at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$7.apply(SparkSubmit.scala:356)
    at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$7.apply(SparkSubmit.scala:356)
    at scala.Option.map(Option.scala:146)
    at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:355)
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:782)
    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:928)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:937)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
    at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2273)
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2367)
    ... 20 more

I read about having to add a Spark Step on the AWS EMR cluster to submit a pyspark code stored on the S3.

Am I correct in saying that I would need to create a step in order to submit my pyspark job stored on the S3?

In the 'Add Step' window that pops up on the AWS Console, in the 'Application location' field, it says that I'll have to type in the location to the JAR file. What JAR file are they referring to? Does my pyspark script have to be packaged into a JAR file and how do I do that or do I mention the path to my pyspark script?

In the 'Add Step' window that pops up on the AWS Console, in the Spark-submit options, how do I know what to write for the --class parameter? Can I leave this field empty? If no, why not?

I have gone through the AWS EMR documentation. I have so many questions because I dived nose-down into the problem and only researched when an error popped up.

ouila
  • 45
  • 1
  • 9

2 Answers2

1

Your spark submit should be this.

spark-submit --master yarn --deploy-mode cluster s3://bucket_name/my_script.py

--py-files is used if you want to pass the python dependency modules, not the application code.

When you are adding step in EMR to run spark job, jar location is your python file path. i.e. s3://bucket_name/my_script.py

srikanth holur
  • 760
  • 4
  • 11
  • Used the spark submit you said, I'm still getting the EmrFileSystem error mentioned in point no.2. So, if I just mention the path to my script in the 'Application location' field, that's enough? Why do I see so many examples with a .jar file? – ouila Jul 02 '20 at 12:53
  • 1
    Yeah, select cluster mode and mention the path. .jar files are for scala/java code. If you are having errors with spark submit, you will have same errors with step as well. Can you tell me your EMR configurations? – srikanth holur Jul 03 '20 at 02:16
  • What do I type under the spark-submit options while adding a step? I went through the documentation and similar questions on stackoverflow. – ouila Jul 03 '20 at 19:30
  • For now, just ignore that. Keep it blank – srikanth holur Jul 03 '20 at 21:28
0

No its not mandatory to use STEP to submit spark job.
You can also use spark-submit

To submit a pyspark script using STEP please refer aws doc and stackoverflow


For problem 1:
By default spark will use python2. You need to add 2 config

Go to $SPARK_HOME/conf/spark-env.sh and add

export PYSPARK_PYTHON=/usr/bin/python3
export PYSPARK_DRIVER_PYTHON=/usr/bin/python3

Note: If you have any custom bundle add that using --py-files


For problem 2:
A hadoop-assembly jar exists on /usr/share/aws/emr/emrfs/lib/. That contains com.amazon.ws.emr.hadoop.fs.EmrFileSystem.

You need to add this to your classpath.

A better option to me is to create a symbolic link of hadoop-assembly jar to HADOOP_HOME (/usr/lib/hadoop) in your bootstrap action.

Snigdhajyoti
  • 1,327
  • 10
  • 26
  • What do I type under the spark-submit options while adding a step? I went through the documentation and similar questions on stackoverflow you shared. They don't seem to answer my question. – ouila Jul 03 '20 at 19:31