I 'm using spark to process following data:
A121
Bword
A342
Bhello
that is, the line starts with 'A' contains a number, otherwise, contains a word.
Now I need to output the lines contain word. then calculate the max value of the number lines.
that is:
myrdd.filter(_.startsWith("B")).saveAsTextFile("mytxt")
val max_value = myrdd.filter(_.startsWith("A")).map(_.substring(1)).max()
But this method pass myrdd twice, so I must cache the myrdd, I don't want to cache it!!!
is there some method to pass the data only once?
for example , like this:
val max_value = myrdd.flatMap{s =>
if (s.startsWith("B")) {
write_to_file(s)
Iterator.empty() // write to hdfs
} else {
Iterator.single(s.substring(1).toInt)
}
}.max()