4

I'm totally new at this, so I don't understand really well how it's doing. I need to run spark on my machine (login with ssh) and set up memory 60g, and 6 cores for execution. This is what I've tried.

spark-submit --master yarn --deploy-mode cluster --executor-memory 60g --executor-cores 6

And this is what I got:

SPARK_MAJOR_VERSION is set to 2, using Spark2
Exception in thread "main" java.lang.IllegalArgumentException: Missing application resource.
    at org.apache.spark.launcher.CommandBuilderUtils.checkArgument(CommandBuilderUtils.java:253)
    at org.apache.spark.launcher.SparkSubmitCommandBuilder.buildSparkSubmitArgs(SparkSubmitCommandBuilder.java:160)
    at org.apache.spark.launcher.SparkSubmitCommandBuilder.buildSparkSubmitCommand(SparkSubmitCommandBuilder.java:276)
    at org.apache.spark.launcher.SparkSubmitCommandBuilder.buildCommand(SparkSubmitCommandBuilder.java:151)
    at org.apache.spark.launcher.Main.main(Main.java:87)

So, I guess there is some things to add to this code line for running and I have no idea what.

Gambit1614
  • 8,547
  • 1
  • 25
  • 51
jovicbg
  • 1,523
  • 8
  • 34
  • 74

2 Answers2

3

Here:

spark-submit --master yarn --deploy-mode cluster --executor-memory 60g --executor-cores 6

you don't specify which the entry point and your application!

Check the spark-submit documentation, which states:

Some of the commonly used options are:

  • --class: The entry point for your application (e.g. org.apache.spark.examples.SparkPi)

  • --master: The master URL for the cluster (e.g. spark://23.195.26.187:7077)

  • --deploy-mode: Whether to deploy your driver on the worker nodes (cluster) or locally as an external client (client) (default: client) †

  • --conf: Arbitrary Spark configuration property in key=value format. For values that contain spaces wrap “key=value” in quotes (as shown).

  • application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes.

  • application-arguments: Arguments passed to the main method of your main class, if any

For Python applications, simply pass a .py file in the place of <application-jar> instead of a JAR, and add Python .zip, .egg or .py files to the search path with --py-files.

Here is an example that takes some JARs and a python file (I didn't include your additional parameters for simplicity):

./spark-submit --jars myjar1.jar,myjar2.jar --py-files path/to/my/main.py arg1 arg2

I hope I can entry to spark shell (with that much memory and cores) and type code in there

Then you need pyspark, not spark-submit! What is the difference between spark-submit and pyspark?

So what you really want to do is this:

pyspark --master yarn --deploy-mode cluster --executor-memory 60g --executor-cores 6
gsamaras
  • 71,951
  • 46
  • 188
  • 305
  • I'm not really sure if I understand. I have a lot of csv files on my machine and I need to filter them and save to pandas. So, I don't have any file to run, I hope I can entry to spark shell (with that much memory and cores) and type code in there. – jovicbg Sep 20 '17 at 10:38
0

If I understand your question correctly, you total number of cores=6 and total memory is 60GB. The parameeters

--executor-memory
--executor-cores

are actually for each executor inside spark. probably you should try

--executor-memory 8G --executor-cores 1 this will create about 6 executors of 8Gb each (total 6*8 = 48GB). The rest 12 GB for operating system processing and metadata.

braj
  • 2,545
  • 2
  • 29
  • 40
  • No, no. I have about 12 cores, maybe more, and a lot of memory, 60 gb is about a half of each. – jovicbg Sep 20 '17 at 11:40
  • in that case, can you just try spark-submit and with just one parameter(with the pythonscript) and see if you get any error. – braj Sep 20 '17 at 11:59
  • and yes, you havnt added the script that you want to run in spark-submit. – braj Sep 20 '17 at 12:02
  • I find solution, thank you. I just need to type pyspark instead spark-submit. :) – jovicbg Sep 20 '17 at 12:07
  • spark-submit for submitting a job, pyspark is for interactive spark access – braj Sep 20 '17 at 12:28