1

I 'm using spark to process following data:

A121
Bword
A342
Bhello

that is, the line starts with 'A' contains a number, otherwise, contains a word.

Now I need to output the lines contain word. then calculate the max value of the number lines.

that is:

myrdd.filter(_.startsWith("B")).saveAsTextFile("mytxt")
val max_value = myrdd.filter(_.startsWith("A")).map(_.substring(1)).max()

But this method pass myrdd twice, so I must cache the myrdd, I don't want to cache it!!!

is there some method to pass the data only once?

for example , like this:

val max_value = myrdd.flatMap{s =>
  if (s.startsWith("B")) {
    write_to_file(s)
    Iterator.empty()  // write to hdfs
  } else {
    Iterator.single(s.substring(1).toInt) 
  }
}.max()
user2848932
  • 776
  • 1
  • 14
  • 28
  • 3
    Possible duplicate of [Spark: Split RDD into two or more RDD](http://stackoverflow.com/questions/32970709/spark-split-rdd-into-two-or-more-rdd) – zero323 Oct 31 '15 at 12:16
  • 1
    Also: http://stackoverflow.com/q/27231524/1560062 – zero323 Oct 31 '15 at 12:16

0 Answers0