sparkR on YARN cluster

Question

I can see at url http://ec2-54-186-47-36.us-west-2.compute.amazonaws.com:8080/ that I have two worker nodes and one master node, it is showing spark cluster. by running command jps on my 2 worker node and 1 master I can see that all services are up. following script I am using to initialise SPARKR session.

    if (nchar(Sys.getenv("SPARK_HOME")) < 1) {
    Sys.setenv(SPARK_HOME = "/home/ubuntu/spark")
    }

but whenever I tried to use Rstudio to initialize session then it fail and shows following ERROR, please advice me , I can not use real benefit of cluster.

   sparkR.session(master = "yarn", deployMode="cluster", sparkConfig = 
   list(spark.driver.memory = "2g"),sparkPackages = "com.databricks:spark-
   csv_2.11:1.1.0")

  Launching java with spark-submit command /home/ubuntu/spark/bin/spark-
  submit  --packages com.databricks:spark-csv_2.11:1.1.0 --driver-memory 
  "2g" "--packages" "com.databricks:spark-csv_2.11:1.1.0" "sparkr-shell" 
   /tmp/RtmpkSWHWX/backend_port29310cbc7c6 
  Ivy Default Cache set to: /home/rstudio/.ivy2/cache
 The jars for the packages stored in: /home/rstudio/.ivy2/jars
 :: loading settings :: url = jar:file:/home/ubuntu/spark/jars/ivy-
 2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.databricks#spark-csv_2.11 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found com.databricks#spark-csv_2.11;1.1.0 in central
found org.apache.commons#commons-csv;1.1 in central
found com.univocity#univocity-parsers;1.5.1 in central
:: resolution report :: resolve 441ms :: artifacts dl 24ms
:: modules in use:
com.databricks#spark-csv_2.11;1.1.0 from central in [default]
com.univocity#univocity-parsers;1.5.1 from central in [default]
org.apache.commons#commons-csv;1.1 from central in [default]
---------------------------------------------------------------------
|                  |            modules            ||   artifacts   |
|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
|      default     |   3   |   0   |   0   |   0   ||   3   |   0   |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
confs: [default]
0 artifacts copied, 3 already retrieved (0kB/18ms)


Setting default log level to "WARN".
   To adjust logging level use sc.setLogLevel(newLevel).
   17/09/24 23:15:34 WARN NativeCodeLoader: Unable to load native-hadoop library 
    for your platform... using builtin-java classes where applicable
    17/09/24 23:15:42 ERROR SparkContext: Error initializing SparkContext.
    org.apache.spark.SparkException: Yarn application has already ended! It 
     might have been killed or unable to launch application master.
    at 



 org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend 
.waitForApplication (YarnClientSchedulerBackend.scala:85)
at 

 org.apache.spark.scheduler.cluster.
 YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:62)
at org.apache.spark.scheduler.TaskSchedulerImpl.
 start(TaskSchedulerImpl.scala:149)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:500)
at org.apache.spark.api.java.JavaSparkContext.<init>
(JavaSparkContext.scala:58)
at org.apache.spark.api.r.RRDD$.createSparkContext(RRDD.scala:129)
at org.apache.spark.api.r.RRDD.createSparkContext(RRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 



sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:141)
    at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:86)
    at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:38)
    at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
    at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
    at java.lang.Thread.run(Thread.java:748)
17/09/24 23:15:42 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
17/09/24 23:15:42 WARN MetricsSystem: Stopping a MetricsSystem that is not running
17/09/24 23:15:42 ERROR RBackendHandler: createSparkContext on org.apache.spark.api.r.RRDD failed
Error in invokeJava(isStatic = TRUE, className, methodName, ...) : 
  org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.
    at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:85)
    at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:62)
    at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:149)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:500)
    at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
    at org.apache.spark.api.r.RRDD$.createSparkContext(RRDD.scala:129)
    at org.apache.spark.api.r.RRDD.createSparkContext(RRDD.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Metho

score 0 · Answer 1 · answered Oct 16 '17 at 12:17

Interactive Spark shells & sessions, such as from RStudio (for R) or from Jupyter notebooks, cannot be run in cluster mode - you should change to deployMode=client.

Here is what happens when trying to run a SparkR shell with --deploy-mode cluster (the situation is practically the same with RStudio):

$ ./sparkR --master yarn --deploy-mode cluster
R version 3.3.2 (2016-10-31) -- "Sincere Pumpkin Patch"
[...]
Error: Cluster deploy mode is not applicable to Spark shells.

See also this answer for the PySpark case.

This does not mean that you do not utilize the distributed benefits of Spark (i.e. cluster computations) in such sessions; from the docs:

There are two deploy modes that can be used to launch Spark applications on YARN. In cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN.

sparkR on YARN cluster

1 Answers1