2

I have a large RDD. I need to group it into two sets. The first set I need to store to disk, the second set I would need to process further then store to disk. Right now I'm achieving that with 2 filters. I thought about using a group by but not sure how that would help (in fact it may even cause a lot of unnecessary shuffling).


Let's assume I have the following JavaPairRDD<Boolean, SomeClass> rdd let's assume that record that have true key compose 99% of rdd. Furthermore let's assume that rdd is large.

  • I need to store the records that have false value on disk.
  • I need to do further processing with records that have true value.

I could do the following (psudo-code)

JavaPairRDD<Boolean, SomeClass> truthy = rdd.filter(selectTrueKeys());
JavaPairRDD<Boolean, SomeClass> falsy = rdd.filter(selectFalseKeys());

falsy
    .map(prepareForStorage())
    .saveAsNewAPIHadoopDataset(job.getConfiguration());

truthy
    .map(mapToSomethingElse())
    .combineByKey(create, combine, merge)
    .map(prepareForStorage())
    .saveAsNewAPIHadoopDataSet(job.getConfiguration());

It seems wasteful to filter the entire set twice. I thought about grouping the rdd by key. But I don't think that will help either.

zero323
  • 322,348
  • 103
  • 959
  • 935
hba
  • 7,406
  • 10
  • 63
  • 105
  • Possible duplicate of [How to split a RDD into two or more RDDs?](http://stackoverflow.com/questions/32970709/how-to-split-a-rdd-into-two-or-more-rdds) – zero323 May 12 '16 at 10:12

0 Answers0