I'm working on a project where I need to load a large Amazon product dataset (126 GB when decompressed) into MongoDB using Apache Spark. The dataset is in line-separated JSON format.
How can I optimize the data loading process while considering the schema of the dataset and the #structure of the MongoDB collections?
import json
from pyspark.sql import SparkSession
# Set up the MongoDB connector
mongo_connector = "org.mongodb.spark:mongo-spark-connector_2.12:3.0.1"
# Initialize the Spark session
spark = SparkSession.builder \
.appName("AmazonReviewsToMongoDB") \
.config("spark.jars.packages", mongo_connector) \
.config("spark.mongodb.input.uri", "mongodb://localhost:27017/amazon_reviews.reviews") \
.config("spark.mongodb.output.uri", "mongodb://localhost:27017/amazon_reviews.reviews") \
.getOrCreate()
# Read the compressed dataset
reviews_rdd = (
spark.sparkContext.textFile(r"X:/reviews.json.gz")
.map(json.loads)
)
# Convert the RDD to a DataFrame
reviews_df = reviews_rdd.toDF()
# Save the DataFrame to MongoDB
reviews_df.write.format("mongo").mode("overwrite").save()
I don't think so this code will work efficiently - i.e read data in minimum time.