3

Problem statement: To transfer data from mongoDB to spark optimally with minimal latency

Problem Description:

I have my data stored in mongoDB and want to process the data (of the order ~100-500GB) using apache spark.

I used the mongoDB-spark connector and was able to read/write data from/to mongoDB (https://docs.mongodb.com/spark-connector/master/python-api/)

The problem was to create spark dataframe each time on the fly.

Is there a solution to handling such huge data transfers?

I looked into :

  1. spark streaming API
  2. Apache Kafka
  3. Amazon S3 and EMR

But couldn't make a decision as to whether it was the optimal way to do it. What strategy would you reckon to handle transferring such data?

Would having the data on the spark cluster and syncing just the deltas (changes in database) to the local file would be the way to go or just reading from mongoDB each time is the only way (or the optimal way) to go about it?

EDIT 1:

The following suggests to read data of mongoDB (due to secondary indexes, data retrieval is faster): https://www.mongodb.com/blog/post/tutorial-for-operationalizing-spark-with-mongodb

EDIT 2:

The advantages of using parquet format : What are the pros and cons of parquet format compared to other formats?

Adarsh
  • 3,273
  • 3
  • 20
  • 44

0 Answers0