2

We will be hosting an EMR cluster (with spot instances) on AWS running on top of an S3 bucket. Data will be stored in this bucket in ORC format. However, we want to use R as well as some kind of a sandbox environment, reading the same data.

I've got the package aws.s3 (cloudyr) running correctly: I can read csv files without a problem, but it seems not to allow me to convert the orc files into something readable.

The two options I founnd online were - SparkR - dataconnector (vertica)

Since installing dataconnector on Windows machine was problamatic, I installed SparkR and I am now able to read a local orc.file (R local on my machine, orc file local on my machine). However if i try read.orc, it by default normalizes my path to a local path. Digging into the source code, I ran the following:

sparkSession <- SparkR:::getSparkSession()
options <- SparkR:::varargsToStrEnv()
read <- SparkR:::callJMethod(sparkSession, "read")
read <- SparkR:::callJMethod(read, "options", options)
sdf <- SparkR:::handledCallJMethod(read, "orc", my_path)

But I obtained the following error:

Error: Error in orc : java.io.IOException: No FileSystem for scheme: https

Could someone help me either with this problem or pointing to an alternative way to load orc files from S3?

Wannes Rosiers
  • 1,680
  • 1
  • 12
  • 18
  • You've tagged this [tag:vertica]. Are you already using R to read data in Vertica and you're stuck on the ORC/S3 part? – Monica Cellio Mar 22 '17 at 16:58
  • I tagged vertica, since the R package dataconnector is a Vertica product. Actually I am using R to read to R itself and I'm stuck on the ORC part (reading from S3 works, but not in a readable format). – Wannes Rosiers Mar 23 '17 at 06:59
  • What version of Vertica? (ORC integration has been under active development in the last few.) I can help with ORC -> Vertica, but I don't know anything about the R part. Does that help you? – Monica Cellio Mar 23 '17 at 14:18
  • For now it would not be a solution, since it's an R package from Vertica, but not really Vertica. If I don't find the solution, I might shift to Vertica, and then it would turn out to be handy. – Wannes Rosiers Mar 23 '17 at 14:35
  • @WannesRosiers Any update on this? I am having the same challenge at the moment (trying to read ORC files from S3 directly in R). Thanks! – michalrudko Apr 11 '17 at 12:54

1 Answers1

2

Edited answer: now you can read directly from S3 instead of first downloading and reading from the local file system

On request of mrjoseph: a possible solution via SparkR (which in the first place I did not want to do).

# Set the System environment variable to where Spark is installed
Sys.setenv(SPARK_HOME="pathToSpark")
Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "org.apache.hadoop:hadoop-aws:2.7.1" "sparkr-shell"')

# Set the library path to include path to SparkR
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"),"R","lib"), .libPaths()))

# Set system environments to be able to load from S3
Sys.setenv("AWS_ACCESS_KEY_ID" = "myKeyID", "AWS_SECRET_ACCESS_KEY" = "myKey", "AWS_DEFAULT_REGION" = "myRegion")

# load required packages
library(aws.s3)
library(SparkR)

## Create a spark context and a sql context
sc<-sparkR.init(master = "local")
sqlContext<-sparkRSQL.init(sc)

# Set path to file
path <- "s3n://bucketname/filename.orc"

# Set hadoop configuration
hConf = SparkR:::callJMethod(sc, "hadoopConfiguration")
SparkR:::callJMethod(hConf, "set", "fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
SparkR:::callJMethod(hConf, "set", "fs.s3n.awsAccessKeyId", "myAccesKey")
SparkR:::callJMethod(hConf, "set", "fs.s3n.awsSecretAccessKey", "mySecrectKey")

# Slight adaptation to read.orc function
sparkSession <- SparkR:::getSparkSession()
options <- SparkR:::varargsToStrEnv()
# Not required: path <- normalizePath(path)
read <- SparkR:::callJMethod(sparkSession, "read")
read <- SparkR:::callJMethod(read, "options", options)
sdf <- SparkR:::handledCallJMethod(read, "orc", path)
temp <- SparkR:::dataFrame(sdf)

# Read first lines
head(temp)
Wannes Rosiers
  • 1,680
  • 1
  • 12
  • 18
  • 1
    IF your version of spark is built against Hadoop 2.7.x or later, use s3a: URLs, including switching to their auth (which is a bit more than just replacing s3n with s3a in the config keys). S3A will automatically pick up the AWS_ environment variable secrets, and on EC2, the VMs own credentials —your life may actually be easier. – stevel Apr 13 '17 at 09:30