If I do the basic groupByKey
operation on a JavaRdd<Tuple2<String, String>>
, I get a JavaPairRdd<Tuple2<String, Iterable<String>>>
:
someStartRdd.groupByKey()
because the size of the iterables in every tuple is going to be quite big (millions) and the number of keys is going to be big too, I'd like to process each iterable in a streaming parallel fashion like with RDD. Ideally I'd like an RDD per key.
For the moment the only thing I could think of is to collect, create lists and then parallelize
:
List<Tuple2<String, Iterable<String>>> r1 = someStartRdd.groupByKey().collect();
for (Tuple2<String, Iterable<String>> tuple : r1){
List<String> listForKey = MagicLibrary.iterableToString(tuple._2());
JavaRdd<String> listRDD = sparkContext.parallelize(listForKey);
...start job on listRDD...
}
but I don't want to put all things in memory to create the list. Better solution?