Run Spark application with custom number of cores and memory size

Question

I'm totally new at this, so I don't understand really well how it's doing. I need to run spark on my machine (login with ssh) and set up memory 60g, and 6 cores for execution. This is what I've tried.

spark-submit --master yarn --deploy-mode cluster --executor-memory 60g --executor-cores 6

And this is what I got:

SPARK_MAJOR_VERSION is set to 2, using Spark2
Exception in thread "main" java.lang.IllegalArgumentException: Missing application resource.
    at org.apache.spark.launcher.CommandBuilderUtils.checkArgument(CommandBuilderUtils.java:253)
    at org.apache.spark.launcher.SparkSubmitCommandBuilder.buildSparkSubmitArgs(SparkSubmitCommandBuilder.java:160)
    at org.apache.spark.launcher.SparkSubmitCommandBuilder.buildSparkSubmitCommand(SparkSubmitCommandBuilder.java:276)
    at org.apache.spark.launcher.SparkSubmitCommandBuilder.buildCommand(SparkSubmitCommandBuilder.java:151)
    at org.apache.spark.launcher.Main.main(Main.java:87)

So, I guess there is some things to add to this code line for running and I have no idea what.

You are missing `--class package.ClassName` and the JAR that you want to submit. — philantrovert, Sep 20 '17 at 10:07

gsamaras · Accepted Answer · 2017-09-20T13:01:56.420

Here:

spark-submit --master yarn --deploy-mode cluster --executor-memory 60g --executor-cores 6

you don't specify which the entry point and your application!

Check the spark-submit documentation, which states:

Some of the commonly used options are:

--class: The entry point for your application (e.g. org.apache.spark.examples.SparkPi)
--master: The master URL for the cluster (e.g. spark://23.195.26.187:7077)
--deploy-mode: Whether to deploy your driver on the worker nodes (cluster) or locally as an external client (client) (default: client) †
--conf: Arbitrary Spark configuration property in key=value format. For values that contain spaces wrap “key=value” in quotes (as shown).
application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes.
application-arguments: Arguments passed to the main method of your main class, if any

For Python applications, simply pass a .py file in the place of <application-jar> instead of a JAR, and add Python .zip, .egg or .py files to the search path with --py-files.

Here is an example that takes some JARs and a python file (I didn't include your additional parameters for simplicity):

./spark-submit --jars myjar1.jar,myjar2.jar --py-files path/to/my/main.py arg1 arg2

I hope I can entry to spark shell (with that much memory and cores) and type code in there

Then you need pyspark, not spark-submit! What is the difference between spark-submit and pyspark?

So what you really want to do is this:

pyspark --master yarn --deploy-mode cluster --executor-memory 60g --executor-cores 6

I'm not really sure if I understand. I have a lot of csv files on my machine and I need to filter them and save to pandas. So, I don't have any file to run, I hope I can entry to spark shell (with that much memory and cores) and type code in there. — jovicbg, Sep 20 '17 at 10:38

score 0 · Answer 2 · answered Sep 20 '17 at 11:36

0

If I understand your question correctly, you total number of cores=6 and total memory is 60GB. The parameeters

--executor-memory
--executor-cores

are actually for each executor inside spark. probably you should try

--executor-memory 8G --executor-cores 1 this will create about 6 executors of 8Gb each (total 6*8 = 48GB). The rest 12 GB for operating system processing and metadata.

answered Sep 20 '17 at 11:36

braj

2,545
2
29
40

No, no. I have about 12 cores, maybe more, and a lot of memory, 60 gb is about a half of each. – jovicbg Sep 20 '17 at 11:40
in that case, can you just try spark-submit and with just one parameter(with the pythonscript) and see if you get any error. – braj Sep 20 '17 at 11:59
and yes, you havnt added the script that you want to run in spark-submit. – braj Sep 20 '17 at 12:02
I find solution, thank you. I just need to type pyspark instead spark-submit. :) – jovicbg Sep 20 '17 at 12:07
spark-submit for submitting a job, pyspark is for interactive spark access – braj Sep 20 '17 at 12:28

Run Spark application with custom number of cores and memory size

2 Answers2