20

I have a spark job written in scala. I use

spark-shell -i <file-name>

to run the job. I need to pass a command-line argument to the job. Right now, I invoke the script through a linux task, where I do

export INPUT_DATE=2015/04/27 

and use the environment variable option to access the value using:

System.getenv("INPUT_DATE")

Is there a better way to handle the command line arguments in Spark-shell?

Jeevs
  • 699
  • 2
  • 10
  • 19
  • 1
    why would you want to pass an argument in spark-shell?!? why don't you use the spark-submit script to run the job normally?? – eliasah Apr 29 '15 at 13:20
  • Still running 0.9.1in CDH 4.6. spark-submit not available yet. – Jeevs May 01 '15 at 03:55
  • 4
    Another reason why you'd want to do that is to avoid the hassle of building a project if you are only running a 2-line scala code. See my answer below as to how I solved this. – Amir Jun 19 '15 at 23:39

3 Answers3

43

My solution is use a customized key to define arguments instead of spark.driver.extraJavaOptions, in case someday you pass in a value that may interfere JVM's behavior.

spark-shell -i your_script.scala --conf spark.driver.args="arg1 arg2 arg3"

You can access the arguments from within your scala code like this:

val args = sc.getConf.get("spark.driver.args").split("\\s+")
args: Array[String] = Array(arg1, arg2, arg3)
soulmachine
  • 3,917
  • 4
  • 46
  • 56
  • nice. slightly cleaner than `spark.driver.extraJavaOptions`. – Amir Nov 14 '16 at 20:01
  • 1
    Actually you can do something easier with -conf spark.driver.arg1 -conf spark.driver.arg2 it seems that all configs prefixed with spark.driver are passed to the driver – Rolintocour Jun 07 '19 at 12:46
23

Short answer:

spark-shell -i <(echo val theDate = $INPUT_DATE ; cat <file-name>)

Long answer:

This solution causes the following line to be added at the beginning of the file before passed to spark-submit:

val theDate = ...,

thereby defining a new variable. The way this is done (the <( ... ) syntax) is called process substitution. It is available in Bash. See this question for more on this, and for alternatives (e.g. mkFifo) for non-Bash environments.

Making this more systematic:

Put the code below in a script (e.g. spark-script.sh), and then you can simply use:

./spark-script.sh your_file.scala first_arg second_arg third_arg, and have an Array[String] called args with your arguments.

The file spark-script.sh:

scala_file=$1

shift 1

arguments=$@

#set +o posix  # to enable process substitution when not running on bash 

spark-shell  --master yarn --deploy-mode client \
         --queue default \
        --driver-memory 2G --executor-memory 4G \
        --num-executors 10 \
        -i <(echo 'val args = "'$arguments'".split("\\s+")' ; cat $scala_file)
Community
  • 1
  • 1
Amir
  • 888
  • 9
  • 18
  • is there a better way to do it? can I pass the arguments like: spark-shell -i script.scala args1 args2; so that in the scala file, i can retrieve the argument like args(1), args(2). this is a scala solution from http://alvinalexander.com/scala/scala-shell-script-command-line-arguments-args. however, it doesn't work in spark-shell. do you have any suggestions? – HappyCoding Jan 13 '16 at 02:20
9

I use the extraJavaOptions when I have a scala script which is too simple to go through the build process but I still need to pass arguments to it. It's not beautiful, but it works and you can quickly pass multiple arguments:

spark-shell -i your_script.scala --conf spark.driver.extraJavaOptions="-Darg1,arg2,arg3"

Note that -D does not belong to the arguments, which are arg1, arg2, and arg3. You can then access the arguments from within your scala code like this:

val sconf = new SparkConf()

// load string
val paramsString = sconf.get("spark.driver.extraJavaOptions")

// cut off `-D`
val paramsSlice = paramsString.slice(2,paramsString.length)

// split the string with `,` as delimiter and save the result to an array
val paramsArray = paramsSlice.split(",")

// access parameters
val arg1 = paramsArray(0)
Nico
  • 151
  • 2
  • 6