how to pass data once to divide one rdd to two output

Asked Oct 31 '15 at 08:37

Active Oct 31 '15 at 08:37

Viewed 143 times

I 'm using spark to process following data:

A121
Bword
A342
Bhello

that is, the line starts with 'A' contains a number, otherwise, contains a word.

Now I need to output the lines contain word. then calculate the max value of the number lines.

that is:

myrdd.filter(_.startsWith("B")).saveAsTextFile("mytxt")
val max_value = myrdd.filter(_.startsWith("A")).map(_.substring(1)).max()

But this method pass myrdd twice, so I must cache the myrdd, I don't want to cache it!!!

is there some method to pass the data only once?

for example , like this:

val max_value = myrdd.flatMap{s =>
  if (s.startsWith("B")) {
    write_to_file(s)
    Iterator.empty()  // write to hdfs
  } else {
    Iterator.single(s.substring(1).toInt) 
  }
}.max()

asked Oct 31 '15 at 08:37

user2848932

3

Possible duplicate of [Spark: Split RDD into two or more RDD](http://stackoverflow.com/questions/32970709/spark-split-rdd-into-two-or-more-rdd) – zero323 Oct 31 '15 at 12:16
1

Also: http://stackoverflow.com/q/27231524/1560062 – zero323 Oct 31 '15 at 12:16

how to pass data once to divide one rdd to two output

0 Answers0