I have written a spark word count program using below code:
package com.practice
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
object WordCount {
val sparkConf = new SparkConf()
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().config(sparkConf).master("local[2]").getOrCreate()
val input = args(0)
val output = args(1)
val text = spark.sparkContext.textFile(input)
val outPath = text.flatMap(line => line.split(" "))
val words = outPath.map(w => (w,1))
val wc = words.reduceByKey((x,y)=>(x+y))
wc.saveAsTextFile(output)
}
}
Using spark submit, I run the jar and got the output in the output dir:
SPARK_MAJOR_VERSION=2 spark-submit --master local[2] --class com.practice.WordCount sparkwordcount_2.11-0.1.jar file:///home/hmusr/ReconTest/inputdir/sample file:///home/hmusr/ReconTest/inputdir/output
Both the input and output files are on local and not on HFDS.
In the output dir, I see two files: part-00000.deflate _SUCCESS
.
The output file is present with .deflate extension. I understood that the output was saved in a compressed file after checking internet but is there any way I can read the file ?