Running Spark 2.0.2 in Standalone Cluster Mode; 2 workers and 1 master node
Simple Test; reading pipe delimited file and writing data to csv. Commands below are executed in spark-shell with master-url set
val df = spark.sqlContext.read.option("delimiter","|").option("quote","\u0000").csv("/home/input-files/")
val emailDf=df.filter("_c3='EML'")
emailDf.repartition(100).write.csv("/opt/outputFile/")
After executing the cmds above in spark-shell with master url set.
In
worker1
-> Each part file is created in/opt/outputFile/_temporary/task-xxxxx-xxx/part-xxx-xxx
Inworker2
->/opt/outputFile/part-xxx
=> part files are generated directly under outputDirectory specified during write.
Same thing happens with coalesce(100) or without specifying repartition/coalesce!!!
Quesiton
1) why worker1
/opt/outputFile/
output directory doesn't have part-xxxx
files just like in worker2
? why _temporary
directory is created and part-xxx-xx
files reside in the task-xxx
directories?
2) Is it because I don't have HDFS
installed on the cluster!?