0

I have this one big flat file with three types of data("M","C","Q") inside it. I am creating one RDD using

val inputRDD=sc.textFile("/user/train");

I am filtering the data by applying three transformation on inputRDD.

val metaRDD=inputRDD.filter(line=>line.contains("M"));
val clickRDD=inputRDD.filter(line=>line.contains("C"));
val QueryRDD=inputRDD.filter(line=>line.contains("Q"));

This will read the entire file three times when we use three rdds in a action. Is there a way to get three RDDs by applying one transformation on inputRDD and reading the file only once.

I am aware of the fact that file will be read only once if we persist the data set. But the file to is too large to be persisted.

Vinay Kumar
  • 1,664
  • 2
  • 15
  • 19
  • I dont think there is anyway to achieve this without persisting RDD. – Pankaj Arora Feb 22 '16 at 05:30
  • If the dataset is too large to stay in memory you can persist it with .persist(StorageLevel.MEMORY_AND_DISK_SER) which will keep on disk the part of rdd which doesn't fit in memory. – drstein Feb 22 '16 at 08:57
  • Why not combine filter predicates? You can use something like `val combined = inputRDD.filter(line => (line.contains("M") || line.contains("C") || line.contains("Q")))`. – mehmetminanc Feb 22 '16 at 11:42
  • @mehmetminanc the input dataset contains only three types of data. your line of code will only create a copy of inputRDD. I am actually looking for a combined filter which should create three RDDs at one read. Is that possible? – Vinay Kumar Feb 22 '16 at 14:36
  • You have a point. Not that I know of, no. But I wonder, if original data set is too large to be persisted in memory, wouldn't partitioned data set be too large as well? – mehmetminanc Feb 22 '16 at 15:02

0 Answers0