I have a large RDD. I need to group it into two sets. The first set I need to store to disk, the second set I would need to process further then store to disk. Right now I'm achieving that with 2 filters. I thought about using a group by but not sure how that would help (in fact it may even cause a lot of unnecessary shuffling).
Let's assume I have the following JavaPairRDD<Boolean, SomeClass> rdd
let's assume that record that have true
key compose 99% of rdd. Furthermore let's assume that rdd is large.
- I need to store the records that have
false
value on disk. - I need to do further processing with records that have
true
value.
I could do the following (psudo-code)
JavaPairRDD<Boolean, SomeClass> truthy = rdd.filter(selectTrueKeys());
JavaPairRDD<Boolean, SomeClass> falsy = rdd.filter(selectFalseKeys());
falsy
.map(prepareForStorage())
.saveAsNewAPIHadoopDataset(job.getConfiguration());
truthy
.map(mapToSomethingElse())
.combineByKey(create, combine, merge)
.map(prepareForStorage())
.saveAsNewAPIHadoopDataSet(job.getConfiguration());
It seems wasteful to filter the entire set twice. I thought about grouping the rdd
by key. But I don't think that will help either.