I searched a solution for a long time but didn't get any correct algorithm.
Using Spark RDDs in scala, how could I transform a RDD[(Key, Value)]
into a Map[key, RDD[Value]]
, knowing that I can't use collect or other methods which may load data into memory ?
In fact, my final goal is to loop on Map[Key, RDD[Value]]
by key and call saveAsNewAPIHadoopFile
for each RDD[Value]
For example, if I get :
RDD[("A", 1), ("A", 2), ("A", 3), ("B", 4), ("B", 5), ("C", 6)]
I'd like :
Map[("A" -> RDD[1, 2, 3]), ("B" -> RDD[4, 5]), ("C" -> RDD[6])]
I wonder if it would cost not too much to do it using filter
on each key A, B, C of RDD[(Key, Value)]
, but I don't know if calling filter as much times there are different keys would be efficient ? (off course not, but maybe using cache
?)
Thank you