Efficiently loading a large Amazon product review dataset into MongoDB using Apache Spark

Question

I'm working on a project where I need to load a large Amazon product dataset (126 GB when decompressed) into MongoDB using Apache Spark. The dataset is in line-separated JSON format.

How can I optimize the data loading process while considering the schema of the dataset and the #structure of the MongoDB collections?

import json
from pyspark.sql import SparkSession

# Set up the MongoDB connector
mongo_connector = "org.mongodb.spark:mongo-spark-connector_2.12:3.0.1"

# Initialize the Spark session
spark = SparkSession.builder \
    .appName("AmazonReviewsToMongoDB") \
    .config("spark.jars.packages", mongo_connector) \
    .config("spark.mongodb.input.uri", "mongodb://localhost:27017/amazon_reviews.reviews") \
    .config("spark.mongodb.output.uri", "mongodb://localhost:27017/amazon_reviews.reviews") \
    .getOrCreate()

# Read the compressed dataset
reviews_rdd = (
    spark.sparkContext.textFile(r"X:/reviews.json.gz")
    .map(json.loads)
)

# Convert the RDD to a DataFrame
reviews_df = reviews_rdd.toDF()

# Save the DataFrame to MongoDB
reviews_df.write.format("mongo").mode("overwrite").save()

I don't think so this code will work efficiently - i.e read data in minimum time.

I've had great success using [`mongoimport`](https://www.mongodb.com/docs/database-tools/mongoimport/ "link to mongoimport docs -->"). Perhaps it will work for you too. — rickhg12hs, Apr 21 '23 at 01:25
the file is present in json.gz format and gzip isnot working with my mongoimport plus i cannot decompress the file due to its huge size (126 GB after decompressing) — Hammad Javaid, Apr 21 '23 at 11:56
You can pipe the output of the decompressor into `mongoimport`. — rickhg12hs, Apr 21 '23 at 13:34
i have to load the dataset using apache spark (bcz of uni project) and as of rn im not able to get anywhere (so mongoimport is out of question) — Hammad Javaid, Apr 21 '23 at 18:20

score 0 · Answer 1 · answered Apr 24 '23 at 01:54

0

You can directly read it as a json file, no need to read it as text file rdd and then do the conversion. Have you tried the below command?

df = spark.read.json("path of the file")

answered Apr 24 '23 at 01:54

Meena Arumugam

139
8

yes it works efficiently but when im writing the resultant df to mongodb it ends up taking all space in C and mongodb crashes then (script too) – Hammad Javaid Apr 28 '23 at 20:14
I wrote the data into mongodb, it is huge (around 43 gb) Now i want to read it and convert it into pandas df what is the most efficient way? – Hammad Javaid May 01 '23 at 08:01
I have not used Pandas DF, Check out the below stack overflow answer, it might help. https://stackoverflow.com/questions/61800463/read-large-json-file-with-index-format-into-pandas-dataframe/61808370#61808370 – Meena Arumugam May 03 '23 at 16:51
@HammadJavaid, was your issue issue resolved? – Meena Arumugam May 11 '23 at 13:37
Yes. First, i defined the schema then i reduced the dataset size, repartitioned it and used the following while writing to MongoDB: spark.mongodb.output.createCollectionOptions spark.mongodb.output.batchSize spark.mongodb.output.bulk.ordered – Hammad Javaid May 12 '23 at 14:50
Oh ok, good to know the way u had resolved it. – Meena Arumugam May 12 '23 at 15:21

Efficiently loading a large Amazon product review dataset into MongoDB using Apache Spark

1 Answers1