I would like to get some explanations concerning the way to combine following R packages:
-odbc
: used to connect to existing Oracle data source
-sparklyr
: used to compute this data on a standalone Spark cluster
Here is what I have done :
-on my client computer, I used dbConnect()
function from ODBC
R package to connect to an existing Oracle database. This Oracle database is hosted on a windows server.
I separately implemented a Spark standalone cluster with some computers located on the same local network but isolated from the windows server: by using this Spark cluster, I would like to use spark_connect()
function of sparklyr
package to connect my client computer ( which is connected to my Oracle data base ) to the Spark cluster.
As a resume my objective consists to use the spark standalone cluster to execute parallel processing (e.g. ml_regression_trees
) of data stored on my oracle data base.
Does someone know if there is a function on sparklyr
to do all of this directly ? ( I mean: connection to Oracle database + big data processing with Spark )
Thank you very much for your help ( any advices are welcome!)