Can sparklyr be used with spark deployed on yarn-managed hadoop cluster?

Question

Is the sparklyr R package able to connect to YARN-managed hadoop clusters? This doesn't seem to be documented in the cluster deployment documentation. Using the SparkR package that ships with Spark it is possible by doing:

# set R environment variables
Sys.setenv(YARN_CONF_DIR=...)
Sys.setenv(SPARK_CONF_DIR=...)
Sys.setenv(LD_LIBRARY_PATH=...)
Sys.setenv(SPARKR_SUBMIT_ARGS=...)

spark_lib_dir <- ... # install specific
library(SparkR, lib.loc = c(sparkr_lib_dir, .libPaths()))
sc <- sparkR.init(master = "yarn-client")

However when I swaped the last lines above with

library(sparklyr)
sc <- spark_connect(master = "yarn-client")

I get errors:

Error in start_shell(scon, list(), jars, packages) : 
  Failed to launch Spark shell. Ports file does not exist.
    Path: /usr/hdp/2.4.2.0-258/spark/bin/spark-submit
    Parameters: '--packages' 'com.databricks:spark-csv_2.11:1.3.0,com.amazonaws:aws-java-sdk-pom:1.10.34' '--jars' '<path to R lib>/3.2/sparklyr/java/rspark_utils.jar'  sparkr-shell /tmp/RtmpT31OQT/filecfb07d7f8bfd.out

Ivy Default Cache set to: /home/mpollock/.ivy2/cache
The jars for the packages stored in: /home/mpollock/.ivy2/jars
:: loading settings :: url = jar:file:<path to spark install>/lib/spark-assembly-1.6.1.2.4.2.0-258-hadoop2.7.1.2.4.2.0-258.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.databricks#spark-csv_2.11 added as a dependency
com.amazonaws#aws-java-sdk-pom added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
    confs: [default]
:: resolution report :: resolve 480ms :: artifacts dl 0ms
    :: modules in use:
    -----------------------------------------

Is sparklyr an alternative to SparkR or is it built on top of the SparkR package?

Looking at the [sparkapi](https://github.com/rstudio/sparkapi) readme the answer to the last question is clearly "it is an alternative to SparkR". Still not sure how to use `master='yarn-client'` though — Matt Pollock, Jun 29 '16 at 15:09
Related question: http://stackoverflow.com/questions/38486163/sparklyr-ports-file-and-java-error-mac-os - seems that the issue keeps popping up in different OS & configurations — desertnaut, Jul 21 '16 at 18:25

score 5 · Accepted Answer · answered Jun 29 '16 at 18:41

5

Yes, sparklyr can be used against a yarn-managed cluster. In order to connect to yarn-managed clusters one needs to:

Set SPARK_HOME environment variable to point to the right spark home directory.
Connect to the spark cluster using the appropriate master location, for instance: sc <- spark_connect(master = "yarn-client")

See also: http://spark.rstudio.com/deployment.html

answered Jun 29 '16 at 18:41

Javier Luraschi

912
5
4

I tried setting SPARK_HOME which took, but the ports file issue remains. It is not clear to me exactly what `spark_connect` is looking for or where it is looking. Is it necessary to pull out names and ports from `yarn-site.xml`? – Matt Pollock Jun 30 '16 at 12:50
Currently, `sparklyr` is an alternative to `sparkr`; I have not tried using them both side-by-side since this is currently unsupported. Could you confirm that you are running your script without the `sparkr` library loaded. If that still does not work, could you dump your system information: OS, version, x86/x64, spark redistribution, etc for us to take a look and reproduce this? Would also be appreciated to open this issue under http://github.com/rstudio.sparklyr to have more people helping unblock this. – Javier Luraschi Jun 30 '16 at 16:33
1

I finally got things working by adding `config=list()` to the inputs of `spark_connect()`. Seems that the error message is a bit misleading. Is the real issue around getting the spark packages installed? – Matt Pollock Aug 18 '16 at 20:24
In older versions of `sparklyr` we specified a CSV package that during `spark_connect()`, Spark would download from Spark's online package repo and therefore, `spark_connect()` required internet connectivity unless `config = list()` was specified to override adding this CSV package. Newer versions of `sparklyr` embed the CSV package to avoid requiring internet connectivity and the `config=list()` is no longer required for offline clusters. – Javier Luraschi Sep 13 '17 at 17:17

score 2 · Answer 2 · answered Mar 17 '17 at 22:48

Yes it can but there is one catch to everything else that has been written, which is very elusive in the blogging literature, and that centers around configuring the resources.

The key is this: when you have it executing in local mode you do not have to configure the resources declaratively, but when you execute in the YARN cluster, you absolutely do have to declare those resources. It took me a long time to find the article that shed some light on this issue but once I tried it, it Worked.

Here's an (arbitrary) example with the key reference:

config <- spark_config()
config$spark.driver.cores <- 32
config$spark.executor.cores <- 32
config$spark.executor.memory <- "40g"

library(sparklyr)

Sys.setenv(SPARK_HOME = "/usr/local/spark")
Sys.setenv(HADOOP_CONF_DIR = '/usr/local/hadoop/etc/hadoop/conf')
Sys.setenv(YARN_CONF_DIR = '/usr/local/hadoop/etc/hadoop/conf')

config <- spark_config()
config$spark.executor.instances <- 4
config$spark.executor.cores <- 4
config$spark.executor.memory <- "4G"

sc <- spark_connect(master="yarn-client", config=config, version = '2.1.0')

R Bloggers Link to Article

desertnaut · Answer 3 · 2016-07-20T16:42:38.453

Are you possibly using Cloudera Hadoop (CDH)?

I am asking as I had the same issue when using the CDH-provided Spark distro:

Sys.getenv('SPARK_HOME')
[1] "/usr/lib/spark"  # CDH-provided Spark
library(sparklyr)
sc <- spark_connect(master = "yarn-client")
Error in sparkapi::start_shell(master = master, spark_home = spark_home,  : 
      Failed to launch Spark shell. Ports file does not exist.
        Path: /usr/lib/spark/bin/spark-submit
        Parameters: --jars, '/u01/app/oracle/product/12.1.0.2/dbhome_1/R/library/sparklyr/java/sparklyr.jar', --packages, 'com.databricks:spark-csv_2.11:1.3.0','com.amazonaws:aws-java-sdk-pom:1.10.34', sparkr-shell, /tmp/Rtmp6RwEnV/file307975dc1ea0.out

Ivy Default Cache set to: /home/oracle/.ivy2/cache
The jars for the packages stored in: /home/oracle/.ivy2/jars
:: loading settings :: url = jar:file:/usr/lib/spark/lib/spark-assembly-1.6.0-cdh5.7.0-hadoop2.6.0-cdh5.7.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.databricks#spark-csv_2.11 added as a dependency
com.amazonaws#aws-java-sdk-pom added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
    confs: [default]
    found com.databricks#spark-csv_2.11;1.3.0 in central
    found org.apache.commons#commons-csv;1.1 in central
    found com.univocity#univocity-parsers;1.5.1 in central
    found com.

However, after I downloaded a pre-built version from Databricks (Spark 1.6.1, Hadoop 2.6) and pointed SPARK_HOME there, I was able to connect successfully:

Sys.setenv(SPARK_HOME = '/home/oracle/spark-1.6.1-bin-hadoop2.6') 
sc <- spark_connect(master = "yarn-client") # OK
library(dplyr)
iris_tbl <- copy_to(sc, iris)
src_tbls(sc)
[1] "iris"

Cloudera does not yet include SparkR in its distribution, and I suspect that sparklyr may still have some subtle dependency on SparkR. Here are the results when trying to work with the CDH-provided Spark, but using the config=list() argument, as suggested in this thread from sparklyr issues at Github:

sc <- spark_connect(master='yarn-client', config=list()) # with CDH-provided Spark
Error in sparkapi::start_shell(master = master, spark_home = spark_home,  : 
  Failed to launch Spark shell. Ports file does not exist.
    Path: /usr/lib/spark/bin/spark-submit
    Parameters: --jars, '/u01/app/oracle/product/12.1.0.2/dbhome_1/R/library/sparklyr/java/sparklyr.jar', sparkr-shell, /tmp/Rtmpi9KWFt/file22276cf51d90.out

Error: sparkr.zip does not exist for R application in YARN mode.

Also, if you check the rightmost part of the Parameters part of the error (both yours and mine), you'll see a reference to sparkr-shell...

(Tested with sparklyr 0.2.28, sparkapi 0.3.15, R session from RStudio Server, Oracle Linux)

Thanks much. I am however on a HDP cluster with spark 1.6.1 - so the under-the-hood R methods should be available in spark. The issue seems to be that I lack a certain port config file that is not apparently needed for anything else. — Matt Pollock, Jul 20 '16 at 20:54

score 0 · Answer 4 · answered Jul 26 '16 at 07:18

0

An upgrade to sparklyr version 0.2.30 or newer is recommended for this issue. Upgrade using devtools::install_github("rstudio/sparklyr") followed by restarting the r session.

answered Jul 26 '16 at 07:18

Javier Luraschi

912
5
4

Thanks for following up, but updating (to 0.2.31) did not resolve the port file issue. The spark installation on my cluster does not seem to have the config file that is expected. `sparklyr` tried to call `.../spark/bin/spark-submit` but the config files are `.../spark/conf` which has things like `hive-site.xml` and`spark-defaults.conf` but no "ports" file. – Matt Pollock Jul 26 '16 at 18:57
I should note that this spark installation has been heavily used with both `pyspark` and `SparkR` without issue. – Matt Pollock Jul 26 '16 at 18:59

Can sparklyr be used with spark deployed on yarn-managed hadoop cluster?

4 Answers4

Linked