Creating multiple RDDs out of one RDD

Question

I have this one big flat file with three types of data("M","C","Q") inside it. I am creating one RDD using

val inputRDD=sc.textFile("/user/train");

I am filtering the data by applying three transformation on inputRDD.

val metaRDD=inputRDD.filter(line=>line.contains("M"));
val clickRDD=inputRDD.filter(line=>line.contains("C"));
val QueryRDD=inputRDD.filter(line=>line.contains("Q"));

This will read the entire file three times when we use three rdds in a action. Is there a way to get three RDDs by applying one transformation on inputRDD and reading the file only once.

I am aware of the fact that file will be read only once if we persist the data set. But the file to is too large to be persisted.

I dont think there is anyway to achieve this without persisting RDD. — Pankaj Arora, Feb 22 '16 at 05:30
If the dataset is too large to stay in memory you can persist it with .persist(StorageLevel.MEMORY_AND_DISK_SER) which will keep on disk the part of rdd which doesn't fit in memory. — drstein, Feb 22 '16 at 08:57
Why not combine filter predicates? You can use something like `val combined = inputRDD.filter(line => (line.contains("M") || line.contains("C") || line.contains("Q")))`. — mehmetminanc, Feb 22 '16 at 11:42
@mehmetminanc the input dataset contains only three types of data. your line of code will only create a copy of inputRDD. I am actually looking for a combined filter which should create three RDDs at one read. Is that possible? — Vinay Kumar, Feb 22 '16 at 14:36
You have a point. Not that I know of, no. But I wonder, if original data set is too large to be persisted in memory, wouldn't partitioned data set be too large as well? — mehmetminanc, Feb 22 '16 at 15:02

Creating multiple RDDs out of one RDD

0 Answers0