saving different parts of an rdd in different text files optimally

Question

I have an rdd, elements of which are a dictionary. The value in this dictionary is a list. In this list, there are 4 elements. Let's say the list is as follows, [1,2,3, No] or [3,5,7, Yes] I want to filter all those elements that has No and save them in one text file while all with yes in another. The rdd involves a lot of processing to get to this yes/no classification. If I use two rdd.filter().saveastextFile, will it take twice the time ? How can do it optimally

Michel Lemay · Answer 1 · 2017-08-22T13:57:39.783

Simply cache your RDD before applying the yes/no filter and save.

To my knowledge, it is not yet possible to fork an RDD into multiple RDDs in one pass.

An idea came to mind. You could mapPartitions and for each partition, filter twice to get two simple arrays and manually save that to two files. Obviously, theses filenames would need to be unique so you could generate a guid at the start of mapPartitions or use a mapPartitionsWithIndex.

score 0 · Answer 2 · answered Aug 22 '17 at 13:54

0

By calling cache() on the RDD before the filtering all the transformations will be saved. Hence it won't take twice as long but only slightly longer time (the time necessary for saving/loading the data as well as the second filtering).

answered Aug 22 '17 at 13:54

Shaido

27,497
23
70
73

saving different parts of an rdd in different text files optimally

2 Answers2