0

I'm trying to assemble a big data infrastructure on my local machine using MongoDB > Apache Spark > RStudio sparklyr. I can't find a solution to connect sparklyr to MongoDB. There are a small number of old posts on the Internet, but no solution yet. MongoDB connectors show support for SparkR, but this package is not on CRAN anymore.

With Pyspark I could connect and is working with the following configs:

# import SparkSession from the pyspark package
from pyspark.sql import SparkSession

# initiate the connection
my_spark = SparkSession \
    .builder \
    .appName("Analysis") \
    .config("http://spark.mongodb.read.connection.uri", "mongodb://127.0.0.1/safricadb.vacancy") \
    .config("spark.mongodb.write.connection.uri", "mongodb://127.0.0.1/safricadb.vacancy") \
    .config('spark.jars.packages', 'org.mongodb.spark:mongo-spark-connector:10.0.2')\
    .getOrCreate()

# load
df = my_spark.read.format('com.mongodb.spark.sql.connector.MongoTableProvider').load()

Could someone offer me guidance on how I should approach this connection with sparklyr?

(Versions: MongoDB Community Server version 5.0.9; Apache Spark 3.3.0 with Hadoop 2.7; Mongo Spark Connector 10.0.2).

Phil
  • 7,287
  • 3
  • 36
  • 66
Willian Adamczyk
  • 1,691
  • 1
  • 9
  • 7
  • the sparkR package is already available in the [spark bundle](https://spark.apache.org/downloads.html) and can be [manually installed](https://stackoverflow.com/a/31185202/8279585). check [sparkR doc](https://spark.apache.org/docs/latest/sparkr.html#starting-up-from-rstudio) for info on starting it on rstudio – samkart Jul 15 '22 at 07:56

1 Answers1

0

Problem solved, the same connection of PySpark works at sparklyr with:

# loading packages
library(tidyverse)
library(sparklyr)

# configuring the connection 
conf <- spark_config()
conf$sparklyr.defaultPackages <- c("org.mongodb.spark:mongo-spark-connector_2.12:3.0.1")
conf$spark.mongodb.input.uri="mongodb://127.0.0.1/safricadb.vacancy"

# stablishing the connection
sc <- spark_connect(master = "local", config = conf)

# retrieving data
dataset <- spark_read_source(
  sc,
  'dataset',
  source = "com.mongodb.spark.sql.DefaultSource",
  memory = FALSE
)
Willian Adamczyk
  • 1,691
  • 1
  • 9
  • 7