My task is to write a code that reads a big file (doesn't fit into memory) reverse it and output most five frequent words .
i have written the code below and it does the job .
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object ReverseFile {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Reverse File")
conf.set("spark.hadoop.validateOutputSpecs", "false")
val sc = new SparkContext(conf)
val txtFile = "path/README_mid.md"
val txtData = sc.textFile(txtFile)
txtData.cache()
val tmp = txtData.map(l => l.reverse).zipWithIndex().map{ case(x,y) => (y,x)}.sortByKey(ascending = false).map{ case(u,v) => v}
tmp.coalesce(1,true).saveAsTextFile("path/out.md")
val txtOut = "path/out.md"
val txtOutData = sc.textFile(txtOut)
txtOutData.cache()
val wcData = txtOutData.flatMap(l => l.split(" ")).map(word => (word, 1)).reduceByKey(_ + _).map(item => item.swap).sortByKey(ascending = false)
wcData.collect().take(5).foreach(println)
}
}
The problem is that i'm new to spark and scala, and as you can see in the code first i read the file reverse it save it then reads it reversed and output the five most frequent words .
- Is there a way to tell spark to save tmp and process wcData (without the need to save,open file) at the same time because otherwise its like reading the file twice .
- From now on i'm going to tackle with spark a lot, so if there is any part of the code (not like the absolute path name ... spark specific) that you might think could be written better i'de appreciate it.