In spark what is an efficient way to process 2 sub-groups in an RDD without filtering the RDD twice?

Question

I have a large RDD. I need to group it into two sets. The first set I need to store to disk, the second set I would need to process further then store to disk. Right now I'm achieving that with 2 filters. I thought about using a group by but not sure how that would help (in fact it may even cause a lot of unnecessary shuffling).

Let's assume I have the following JavaPairRDD<Boolean, SomeClass> rdd let's assume that record that have true key compose 99% of rdd. Furthermore let's assume that rdd is large.

I need to store the records that have false value on disk.
I need to do further processing with records that have true value.

I could do the following (psudo-code)

JavaPairRDD<Boolean, SomeClass> truthy = rdd.filter(selectTrueKeys());
JavaPairRDD<Boolean, SomeClass> falsy = rdd.filter(selectFalseKeys());

falsy
    .map(prepareForStorage())
    .saveAsNewAPIHadoopDataset(job.getConfiguration());

truthy
    .map(mapToSomethingElse())
    .combineByKey(create, combine, merge)
    .map(prepareForStorage())
    .saveAsNewAPIHadoopDataSet(job.getConfiguration());

It seems wasteful to filter the entire set twice. I thought about grouping the rdd by key. But I don't think that will help either.

Possible duplicate of [How to split a RDD into two or more RDDs?](http://stackoverflow.com/questions/32970709/how-to-split-a-rdd-into-two-or-more-rdds) — zero323, May 12 '16 at 10:12

In spark what is an efficient way to process 2 sub-groups in an RDD without filtering the RDD twice?

0 Answers0