How do I install SparkR version 2.4.7 on EMR cluster?

Question

Similar questions have been asked, but I'm unable to make this work using any of the provided solutions. I want to install SparkR on an EMR cluster, and specifically want to use SparkR as opposed to sparklyr because of the user-defined function capabilities. And I eventually want to use RStudio, meaning I don't want to open an EMR notebook and use the SparkR kernel there (which errors on every cell.)

Here are the specs for the cluster linux environment:

lsb_release -a
LSB Version:    :core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch
Distributor ID: Amazon
Description:    Amazon Linux release 2 (Karoo)
Release:        2
Codename:       Karoo

My Spark version is 2.4.7 and the cluster has a pre-installed R from the amzn2-core repo. To permit installation of devtools I use these commands from a co-worker,

sudo yum -y install libcurl libcurl-devel  1>&2
sudo yum -y install R-devel

and then fire up R from the command line and try to install SparkR using devtools as suggested in Installing of SparkR .

install.packages("devtools", Ncpus=12)
devtools::install_github('apache/spark@v2.4.7', subdir='R/pkg')

ERROR: this R is version 3.4.3, package 'SparkR' requires R >= 3.5
Warning message:
In i.p(...) :

So I need to update the R version. I fire up a fresh cluster and update R using the command suggested here How to install R language version 4 in AWS EMR - Amazon linux 2 ,

sudo amazon-linux-extras install R4

But I can't get SparkR to work in this installation either. After opening R from the command line and installing devtools and SparkR as above, I try to start a session and get a warning:

sparkR.session(master="yarn",sparkHome="/usr/lib/spark")
Warning message:
In sparkR.session(master = "yarn", sparkHome = "/usr/lib/spark") :
  Version mismatch between Spark JVM and SparkR package. JVM version was 2.4.7-amzn-0 , while R package version was 2.4.7

and elementary Spark operations produce errors

df <- as.DataFrame(list(1,2,3),"foo")
head(df)
ask 0.3 in stage 1.0 (TID 4, ip-10-130-27-104.columbuschildrens.net, executor 4): org.apache.spark.SparkException: R computation failed with
 Error in unserialize(SparkR:::readRaw(inputCon)) :
  cannot read workspace version 3 written by R 4.0.2; need R 3.5.0 or newer

The same error message is addressed in these posts Cannot read workspace version 3 written by R 4.0.2; need R 3.5.0 or newer Error during install.package : cannot read unreleased workspace version 3 written by experimental R 3.5.0 It looks like SparkR is writing partitions under one workspace version and then trying to read them under the other, hence the error. At this point though, I've tried many combinations of updating/re-installing and am at my wits' end. Any help would be appreciated.

I resolved this issue by ditching SparkR in favor of sparklyr. Contrary to what I stated above, sparklyr actually has quite serviceable UDF functionality. — Jeff, Nov 18 '21 at 18:45

How do I install SparkR version 2.4.7 on EMR cluster?

0 Answers0