1

I would like to get some explanations concerning the way to combine following R packages:

-odbc : used to connect to existing Oracle data source

-sparklyr : used to compute this data on a standalone Spark cluster

Here is what I have done :

-on my client computer, I used dbConnect() function from ODBC R package to connect to an existing Oracle database. This Oracle database is hosted on a windows server.

I separately implemented a Spark standalone cluster with some computers located on the same local network but isolated from the windows server: by using this Spark cluster, I would like to use spark_connect() function of sparklyr package to connect my client computer ( which is connected to my Oracle data base ) to the Spark cluster. As a resume my objective consists to use the spark standalone cluster to execute parallel processing (e.g. ml_regression_trees) of data stored on my oracle data base.

Does someone know if there is a function on sparklyr to do all of this directly ? ( I mean: connection to Oracle database + big data processing with Spark )

Thank you very much for your help ( any advices are welcome!)

JeanBertin
  • 633
  • 1
  • 7
  • 23
  • I didn't try ODBC, but when I read data from oracle to spark I use `spark_read_jdbc()`. If you are interested in the JDBC solution, I can post it here. – nachti Mar 15 '18 at 10:01

1 Answers1

4

Knowing that you asked for an ODBC way, here is an JDBC solution (probably useful for other users and due to the fact, that ODBC is not mentioned in the question title.

You need to have ojdbc7.jar somewhere (in this case in your working directory, but I recommend to store it central and provide the path here). Change the required values like spark_home etc. If you are running R on your client computer (and not on an edge node in the cluster), you might use Livy to connect to Spark.

library(sparklyr)
library(RJDBC)

##### Spark
config <- spark_config()
### tell config location of oracle jar
config[["sparklyr.jars.default"]] <- "ojdbc7.jar"
### example spark_home
sc <- spark_connect(master = "yarn-client",
                    spark_home = "/usr/lib/spark",
                    version = "2.2.0",
                    config = config)

datspark <- spark_read_jdbc(sc, "table", options = list(
  url = "jdbc:oracle:thin:@//<ip>:1521/<schema>",
  driver = "oracle.jdbc.OracleDriver",
  user = "user",
  password = "password",
  dbtable = "table"),
  memory = FALSE # don't cache the whole (big) table
  )

### your R code here

spark_disconnect(sc)
nachti
  • 1,086
  • 7
  • 20
  • Thanks ... worked nicely! Just had to change to `ojdbc6.jar` and was good to go. – Tarun Parmar Mar 19 '18 at 20:04
  • @TarunParmar: It's just the version. See: https://stackoverflow.com/questions/31931251/what-is-the-difference-between-ojdbc6-jar-and-ojdbc7-jar – nachti Mar 20 '18 at 09:16
  • Thank's ! worked for me too ... Do you know if it possible to save the spark table ( called datspark ) to my computer into hdfs format using sparklyr ? – JeanBertin Apr 12 '18 at 12:39
  • @John: hdfs is not a format but a file system. If you want to write e.g. csv or parquet files to hdfs see http://spark.rstudio.com/dplyr/ – nachti Apr 12 '18 at 15:10
  • Hi, I tried using this with a remote oracle 12c database and spark setup in my local machine. But I'm keep getting an error. – Thisara Watawana Jan 17 '19 at 13:34
  • @ThisaraWatawana: Could you be a bit more specific -- which error? – nachti Jan 22 '19 at 10:02