0

I have an rdd, elements of which are a dictionary. The value in this dictionary is a list. In this list, there are 4 elements. Let's say the list is as follows, [1,2,3, No] or [3,5,7, Yes] I want to filter all those elements that has No and save them in one text file while all with yes in another. The rdd involves a lot of processing to get to this yes/no classification. If I use two rdd.filter().saveastextFile, will it take twice the time ? How can do it optimally

Ravi Ranjan
  • 353
  • 1
  • 6
  • 22

2 Answers2

0

Simply cache your RDD before applying the yes/no filter and save.

To my knowledge, it is not yet possible to fork an RDD into multiple RDDs in one pass.

An idea came to mind. You could mapPartitions and for each partition, filter twice to get two simple arrays and manually save that to two files. Obviously, theses filenames would need to be unique so you could generate a guid at the start of mapPartitions or use a mapPartitionsWithIndex.

Michel Lemay
  • 2,054
  • 2
  • 17
  • 34
0

By calling cache() on the RDD before the filtering all the transformations will be saved. Hence it won't take twice as long but only slightly longer time (the time necessary for saving/loading the data as well as the second filtering).

Shaido
  • 27,497
  • 23
  • 70
  • 73