0

Running Spark 2.0.2 in Standalone Cluster Mode; 2 workers and 1 master node

Simple Test; reading pipe delimited file and writing data to csv. Commands below are executed in spark-shell with master-url set

val df = spark.sqlContext.read.option("delimiter","|").option("quote","\u0000").csv("/home/input-files/")
val emailDf=df.filter("_c3='EML'")
emailDf.repartition(100).write.csv("/opt/outputFile/")

After executing the cmds above in spark-shell with master url set.

In worker1 -> Each part file is created in/opt/outputFile/_temporary/task-xxxxx-xxx/part-xxx-xxx
In worker2 -> /opt/outputFile/part-xxx => part files are generated directly under outputDirectory specified during write.

Same thing happens with coalesce(100) or without specifying repartition/coalesce!!!

Quesiton

1) why worker1 /opt/outputFile/ output directory doesn't have part-xxxx files just like in worker2? why _temporary directory is created and part-xxx-xx files reside in the task-xxx directories?

2) Is it because I don't have HDFS installed on the cluster!?

Omkar Puttagunta
  • 4,036
  • 3
  • 22
  • 35
  • @user8371915 I don't understand the exact reason by looking at the answers for other questions! Could you please explain! – Omkar Puttagunta Aug 31 '18 at 11:30
  • 1
    The write is finalized by moving data from `_temporary` directory to its final destination. This is done by the driver. Since you write to local file system, this just won't work. You might see a worker correctly finishing the process, if it is co-located with the driver. – Alper t. Turker Aug 31 '18 at 11:41
  • @user8371915 so If I want `part` files under `outputDirectoryPath`, I need to have HDFS running on the cluster? – Omkar Puttagunta Aug 31 '18 at 13:42
  • HDFS or other distributed / shared file system. – Alper t. Turker Aug 31 '18 at 15:54

0 Answers0