Its a continuation of this question: Porting a multi-threaded compute intensive job to spark
I am using forEachPartition
as suggested here to loop over a list of 10000 IDs, then I do a repartition(20)
because every partition creates DB connection and if I create say 100 partitions, the job just dies because of 100 open connections to postgres and mongo. I use postgres connections not only to store data but to lookup some data from another table.
I can get rid of storing the data to postgres directly from my task and do it as post processing from a sequence file.
But I ideally need to massively parallelize my spark job so that the task completed within a given time, currently it processed about 200 IDs in 20hrs, but I need to process 10000 IDs in 20hrs. So repartition(20)
is clearly not helping. I am bound by IO on db here.
So what are my options where I can efficiently share this data across all tasks? I want data in mongo and postgres to be treated as in memory lookup tables - total size is about 500gb.
My options are:
- RDD (I don't think RDD fits my usecase)
- Dataframe
- Broadcast variables (not sure of this will work as its creation needs 500gb available in the spark driver)
- Move data from mongo to s3 and tasks lookup from s3.