We will be hosting an EMR cluster (with spot instances) on AWS running on top of an S3 bucket. Data will be stored in this bucket in ORC format. However, we want to use R as well as some kind of a sandbox environment, reading the same data.
I've got the package aws.s3 (cloudyr) running correctly: I can read csv files without a problem, but it seems not to allow me to convert the orc files into something readable.
The two options I founnd online were - SparkR - dataconnector (vertica)
Since installing dataconnector on Windows machine was problamatic, I installed SparkR and I am now able to read a local orc.file (R local on my machine, orc file local on my machine). However if i try read.orc, it by default normalizes my path to a local path. Digging into the source code, I ran the following:
sparkSession <- SparkR:::getSparkSession()
options <- SparkR:::varargsToStrEnv()
read <- SparkR:::callJMethod(sparkSession, "read")
read <- SparkR:::callJMethod(read, "options", options)
sdf <- SparkR:::handledCallJMethod(read, "orc", my_path)
But I obtained the following error:
Error: Error in orc : java.io.IOException: No FileSystem for scheme: https
Could someone help me either with this problem or pointing to an alternative way to load orc files from S3?