0

I'm working on a project where I need to load a large Amazon product dataset (126 GB when decompressed) into MongoDB using Apache Spark. The dataset is in line-separated JSON format.

How can I optimize the data loading process while considering the schema of the dataset and the #structure of the MongoDB collections?

import json
from pyspark.sql import SparkSession

# Set up the MongoDB connector
mongo_connector = "org.mongodb.spark:mongo-spark-connector_2.12:3.0.1"

# Initialize the Spark session
spark = SparkSession.builder \
    .appName("AmazonReviewsToMongoDB") \
    .config("spark.jars.packages", mongo_connector) \
    .config("spark.mongodb.input.uri", "mongodb://localhost:27017/amazon_reviews.reviews") \
    .config("spark.mongodb.output.uri", "mongodb://localhost:27017/amazon_reviews.reviews") \
    .getOrCreate()

# Read the compressed dataset
reviews_rdd = (
    spark.sparkContext.textFile(r"X:/reviews.json.gz")
    .map(json.loads)
)

# Convert the RDD to a DataFrame
reviews_df = reviews_rdd.toDF()

# Save the DataFrame to MongoDB
reviews_df.write.format("mongo").mode("overwrite").save()

I don't think so this code will work efficiently - i.e read data in minimum time.

rickhg12hs
  • 10,638
  • 6
  • 24
  • 42
  • I've had great success using [`mongoimport`](https://www.mongodb.com/docs/database-tools/mongoimport/ "link to mongoimport docs -->"). Perhaps it will work for you too. – rickhg12hs Apr 21 '23 at 01:25
  • the file is present in json.gz format and gzip isnot working with my mongoimport plus i cannot decompress the file due to its huge size (126 GB after decompressing) – Hammad Javaid Apr 21 '23 at 11:56
  • You can pipe the output of the decompressor into `mongoimport`. – rickhg12hs Apr 21 '23 at 13:34
  • i have to load the dataset using apache spark (bcz of uni project) and as of rn im not able to get anywhere (so mongoimport is out of question) – Hammad Javaid Apr 21 '23 at 18:20

1 Answers1

0

You can directly read it as a json file, no need to read it as text file rdd and then do the conversion. Have you tried the below command?

df = spark.read.json("path of the file")
  • yes it works efficiently but when im writing the resultant df to mongodb it ends up taking all space in C and mongodb crashes then (script too) – Hammad Javaid Apr 28 '23 at 20:14
  • I wrote the data into mongodb, it is huge (around 43 gb) Now i want to read it and convert it into pandas df what is the most efficient way? – Hammad Javaid May 01 '23 at 08:01
  • I have not used Pandas DF, Check out the below stack overflow answer, it might help. https://stackoverflow.com/questions/61800463/read-large-json-file-with-index-format-into-pandas-dataframe/61808370#61808370 – Meena Arumugam May 03 '23 at 16:51
  • @HammadJavaid, was your issue issue resolved? – Meena Arumugam May 11 '23 at 13:37
  • Yes. First, i defined the schema then i reduced the dataset size, repartitioned it and used the following while writing to MongoDB: spark.mongodb.output.createCollectionOptions spark.mongodb.output.batchSize spark.mongodb.output.bulk.ordered – Hammad Javaid May 12 '23 at 14:50
  • Oh ok, good to know the way u had resolved it. – Meena Arumugam May 12 '23 at 15:21